A human body image matching method and device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By performing cross-layer bidirectional inter-encoding of self-attention information on human images through segmentation and grouping, feature vectors and pose values are extracted, solving the problem of insufficient matching accuracy of human images under different poses and achieving more flexible and accurate matching results.

CN115272721BActive Publication Date: 2026-06-16BEIJING ZHIDA TIANJIE COMMERCIAL OPERATION MANAGEMENT CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING ZHIDA TIANJIE COMMERCIAL OPERATION MANAGEMENT CO LTD
Filing Date: 2022-07-29
Publication Date: 2026-06-16

Application Information

Patent Timeline

29 Jul 2022

Application

16 Jun 2026

Publication

CN115272721B

IPC: G06V10/74; G06V40/10

CPC: G06V10/761; G06V40/103

AI Tagging

Application Domain

Biometric pattern recognition

Technical Efficacy Phrases

Matching results are flexible and preciseFlexible and precise determination

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies struggle to maintain high accuracy in human image matching under various large poses, postures, and angles, resulting in insufficient precision in existing methods.

Method used

By extracting the segmented group self-attention information of the human images to be matched, and performing cross-layer bidirectional inter-encoding, human feature vectors and pose values are obtained. These features and values are then used to determine the similarity between the two images.

Benefits of technology

It achieves flexibility and accuracy in matching human images under different poses, and improves the accuracy of matching results.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115272721B_ABST

Patent Text Reader

Abstract

The present disclosure provides a human body image matching method and device. The method extracts a human body feature vector and a pose value of each to-be-matched human body image, and determines the similarity of two to-be-matched human body images according to the human body feature vector and the pose value of each to-be-matched human body image. That is, the similarity of the two to-be-matched human body images in the method is determined according to the human body feature vector and the pose value of each to-be-matched human body image. Therefore, the determination manner of the similarity of the two to-be-matched human body images can be used for human body image matching under different poses, which makes the determination manner of the similarity of the two to-be-matched human body images more flexible and accurate, and thus the human body image matching result determined according to the similarity is more accurate.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of image processing technology, and in particular to a method and apparatus for human body image matching. Background Technology

[0002] Human image recognition algorithms typically aim to overcome the visual limitations of fixed cameras and can be combined with pedestrian detection / tracking technologies. They are widely used in video surveillance, security, and other fields. Generally, multiple human images are acquired and compared to determine if they belong to the same user. However, existing human image comparison and retrieval techniques often extract only a single fixed feature from each image. This fixed feature is difficult to fit various conditions in complex real-world comparisons, such as comparing images of people with different poses, postures, and angles, resulting in a significant drop in accuracy. Therefore, a new human image matching method is needed. Summary of the Invention

[0003] In view of this, the present disclosure provides a human image matching method, apparatus, computer device, and computer-readable storage medium to solve the problem of inaccurate human image matching results when the facial pose changes rapidly and drastically in the prior art.

[0004] A first aspect of this disclosure provides a human image matching method, the method comprising:

[0005] Obtain two human images to be matched;

[0006] For each human image to be matched, the segmented grouping self-attention information of the human image to be matched is determined based on the human image to be matched; the segmented grouping self-attention information of the human image to be matched is fused and encoded to obtain the human feature vector and pose value of the human image to be matched.

[0007] The similarity between the two human images to be matched is determined based on their respective human feature vectors and pose values.

[0008] The matching result of the two human body images to be matched is determined based on their similarity.

[0009] A second aspect of this disclosure provides a human body image matching apparatus, the apparatus comprising:

[0010] The image acquisition unit is used to acquire two human images to be matched.

[0011] The information acquisition unit is used to determine the segmented grouping self-attention information of each human image to be matched based on the human image to be matched; and to perform fusion encoding processing on the segmented grouping self-attention information of the human image to be matched to obtain the human feature vector and pose value of the human image to be matched.

[0012] The similarity determination unit is used to determine the similarity between the two human images to be matched based on the human feature vectors and pose values corresponding to the two human images to be matched, respectively.

[0013] The result determination unit is used to determine the matching result of the two human images to be matched based on their similarity.

[0014] A third aspect of this disclosure provides a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method described above.

[0015] A fourth aspect of this disclosure provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described method.

[0016] The beneficial effects of this embodiment compared with the prior art are as follows: This embodiment first determines the segmented grouping self-attention information of each human image to be matched, and performs fusion encoding processing on the segmented grouping self-attention information of the human image to be matched to obtain the human feature vector and pose value of the human image to be matched; then, the similarity between the two human images to be matched can be determined according to the human feature vector and pose value corresponding to each of the two human images to be matched, and the matching result between the two human images to be matched can be determined according to the similarity between the two human images to be matched. Since this embodiment extracts the human feature vector and pose value of each human image to be matched, and determines the similarity between two human images to be matched based on the human feature vector and pose value of each human image to be matched, that is, the similarity between two human images to be matched in this embodiment is determined based on the human feature vector and pose value of each human image to be matched; therefore, the method of determining the similarity between two human images to be matched can be used for human image matching under different poses. In this way, the method of determining the similarity between two human images to be matched is more flexible and accurate, thereby making the human image matching result determined based on the similarity more accurate. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of this disclosure, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a schematic diagram illustrating an application scenario of an embodiment of this disclosure;

[0019] Figure 2 This is a flowchart of the human image matching method provided in the embodiments of this disclosure;

[0020] Figure 3 This is a block diagram of the human image matching device provided in the embodiments of this disclosure;

[0021] Figure 4 This is a schematic diagram of a computer device provided in an embodiment of this disclosure. Detailed Implementation

[0022] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, so as to provide a thorough understanding of the embodiments of this disclosure. However, those skilled in the art will understand that this disclosure may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this disclosure with unnecessary detail.

[0023] A human image matching method and apparatus according to embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings.

[0024] In existing technologies, current human image matching and retrieval processes often extract only a single fixed feature from a human image. This fixed feature is difficult to fit various conditions during complex comparisons in real-world applications. For example, comparing human images with different large poses, postures, and angles results in a significant drop in accuracy. In other words, the current methods for human image matching accuracy are still far from meeting the requirements of practical applications and require further research and improvement.

[0025] To address the aforementioned problems, this invention provides a human image matching method. In this method, the human feature vector and pose value of each human image to be matched are extracted, and the similarity between two human images to be matched is determined based on these parameters. In other words, the similarity between two human images to be matched in this method is determined based on the human feature vector and pose value of each image. Therefore, the method for determining the similarity between two human images to be matched can be used for human image matching under different poses. This makes the method for determining the similarity between two human images to be matched more flexible and accurate, thereby making the human image matching result determined based on this similarity more precise.

[0026] For example, embodiments of the present invention can be applied to, for example... Figure 1 The application scenario shown can include terminal device 1 and server 2.

[0027] Terminal device 1 can be hardware or software. When terminal device 1 is hardware, it can be various electronic devices with image acquisition capabilities and supporting communication with server 2, including but not limited to smartphones, tablets, laptops, and desktop computers; when terminal device 1 is software, it can be installed in the aforementioned electronic devices. Terminal device 1 can be implemented as multiple software programs or software modules, or as a single software program or software module; this embodiment of the disclosure does not impose any limitations on this. Server 2 can be a server that provides various services, such as a backend server that receives requests sent by terminal devices with which it has established communication connections. This backend server can receive and analyze the requests sent by the terminal devices and generate processing results. Server 2 can be a single server, a server cluster consisting of several servers, or a cloud computing service center; this embodiment of the disclosure does not impose any limitations on this.

[0028] It should be noted that server 2 can be either hardware or software. When server 2 is hardware, it can be various electronic devices that provide various services to terminal device 1. When server 2 is software, it can be multiple software programs or software modules that provide various services to terminal device 1, or it can be a single software program or software module that provides various services to terminal device 1. This disclosure does not impose any limitations on this aspect.

[0029] Terminal device 1 and server 2 can communicate via a network. The network can be a wired network using coaxial cable, twisted pair, or fiber optic connection, or a wireless network that enables interconnection of various communication devices without wiring, such as Bluetooth, Near Field Communication (NFC), or Infrared. This disclosure does not limit the scope of the embodiments.

[0030] Specifically, a user can input two human images to be matched via terminal device 1, which then sends them to server 2. Server 2 first determines the segmented grouping self-attention information for each image based on its own characteristics. Then, it performs fusion encoding on this segmented grouping self-attention information to obtain the human feature vector and pose value of the image. Next, it determines the similarity between the two images based on their respective human feature vectors and pose values. Finally, it determines the matching result based on this similarity. Server 2 returns the matching result to terminal device 1, allowing terminal device 1 to display the result to the user. This method makes the determination of the similarity between the two images more flexible and precise, resulting in a more accurate matching result.

[0031] It should be noted that the specific types, quantities, and combinations of terminal device 1, server 2, and network can be adjusted according to the actual needs of the application scenario, and this disclosure embodiment does not impose any restrictions on this.

[0032] It should be noted that the above application scenarios are shown only for the purpose of understanding this disclosure, and the implementation of this disclosure is not limited in any way. On the contrary, the implementation of this disclosure can be applied to any applicable scenario.

[0033] Figure 2 This is a flowchart of a human image matching method provided in an embodiment of this disclosure. Figure 2 A human image matching method can be derived from Figure 1 The terminal device or server executes the command. For example... Figure 2 As shown, the human image matching method includes:

[0034] S201: Obtain two human body images to be matched.

[0035] In this embodiment, the human images to be matched can be understood as images for which human image matching is required. As an example, the two human images to be matched can be captured by a surveillance camera installed in a fixed location, captured by a mobile terminal device, or read from a storage device that pre-stores images. It should be noted that, in one implementation, the two human images to be matched can be two video frames extracted from a video.

[0036] S202: For each human image to be matched, determine the segmented grouping self-attention information of the human image to be matched; perform fusion encoding processing on the segmented grouping self-attention information of the human image to be matched to obtain the human feature vector and pose value of the human image to be matched.

[0037] In this embodiment, after obtaining two human images to be matched, the human feature vector and pose value of each human image to be matched can be extracted first, so that the similarity between the two human images to be matched can be determined by using the human feature vector and pose value of each human image to be matched.

[0038] It should be noted that the human feature vector of the human image to be matched can be understood as feature information that reflects the location of the human image region in the human image to be matched. In other words, the human feature vector of the human image to be matched can be used to identify the human image region in the human image to be matched. The pose value of the human image to be matched can be understood as information that reflects the human pose (i.e., human posture) in the human image to be matched. In other words, the human feature vector of the human image to be matched can be used to identify the human posture in the human image to be matched.

[0039] As an example, the method for determining the human feature vector and pose value of the human image to be matched is described below. First, based on the human image to be matched, the segmented grouping self-attention information of the human image to be matched can be determined. That is, segmented grouping self-attention processing can be performed on the human image to be matched to obtain the segmented grouping self-attention information, thus realizing cross-layer encoding from shallow to deep. Then, the segmented grouping self-attention information of the human image to be matched can be fused and encoded to obtain the human feature vector and pose value of the human image to be matched. This realizes the reversal of the fusion direction and performs fusion encoding from deep to shallow. It can be seen that in this embodiment, by performing cross-layer bidirectional inter-encoding (i.e., cross-layer encoding from shallow to deep and fusion encoding from deep to shallow) on the human image to be matched, the human feature vector and pose value of the human image to be matched can be obtained.

[0040] S203: Determine the similarity between the two human images to be matched based on their respective human feature vectors and pose values.

[0041] In this embodiment, after determining the corresponding human feature vector and pose value for each human image to be matched, the dot product value between the two images can be determined using their respective human feature vectors, and the weight value can be determined using their respective pose values. It is understood that the larger the absolute difference between the pose values of the two images, the less similar their poses are, and therefore, the smaller the corresponding weight value; conversely, the smaller the absolute difference between the pose values, the more similar their poses are, and therefore, the larger the corresponding weight value. Then, the similarity between the two images can be determined using the dot product value and the weight value.

[0042] S204: Based on the similarity between the two human body images to be matched, determine the matching result between the two human body images to be matched.

[0043] Understandably, the greater the similarity between two images of people to be matched, the higher the probability that the images correspond to the same user; conversely, the lower the similarity, the lower the probability that the images correspond to the same user. Therefore, in this embodiment, a preset similarity threshold can be set. If the similarity between the two images is equal to or greater than the preset similarity threshold, the matching result is determined to be that the two images belong to the same user; if the similarity is less than the preset similarity threshold, the matching result is determined to be that the two images belong to different users.

[0044] The beneficial effects of this embodiment compared with the prior art are as follows: This embodiment first determines the segmented grouping self-attention information of each human image to be matched, and performs fusion encoding processing on the segmented grouping self-attention information of the human image to be matched to obtain the human feature vector and pose value of the human image to be matched; then, the similarity between the two human images to be matched can be determined according to the human feature vector and pose value corresponding to each of the two human images to be matched, and the matching result between the two human images to be matched can be determined according to the similarity between the two human images to be matched. Since this embodiment extracts the human feature vector and pose value of each human image to be matched, and determines the similarity between two human images to be matched based on the human feature vector and pose value of each human image to be matched, that is, the similarity between two human images to be matched in this embodiment is determined based on the human feature vector and pose value of each human image to be matched; therefore, the method of determining the similarity between two human images to be matched can be used for human image matching under different poses. In this way, the method of determining the similarity between two human images to be matched is more flexible and accurate, thereby making the human image matching result determined based on the similarity more accurate.

[0045] Next, we will introduce one implementation method of "determining the segmented grouping self-attention information of the human image to be matched based on the human image to be matched" in S202. That is, in this embodiment, the step of determining the segmented grouping self-attention information of the human image to be matched based on the human image to be matched may include the following steps:

[0046] S202a: Input the human image to be matched into the feature map extraction module in the cross-layer bidirectional inter-coding network to obtain the first feature map, the second feature map and the third feature map.

[0047] In this embodiment, the method can be applied to a cross-layer bidirectional inter-coding network, wherein the cross-layer bidirectional inter-coding network may include a feature map extraction module, which can be used to extract feature maps from the human image to be matched. In one implementation, the cross-layer bidirectional inter-coding network can be a residual neural network (ResNet).

[0048] As an example, the feature map extraction module may include four downsampling layers connected in sequence (i.e., cascaded). In this embodiment, after inputting the human image to be matched into the feature map extraction module in the cross-layer bidirectional inter-coding network, the feature maps output by the second, third, and fourth downsampling layers in the feature map extraction module can be used as the first feature map, the second feature map, and the third feature map, respectively. For example, assuming the resolution of the human image to be matched is (3, 384, 128), where the three parameters correspond to the number of channels, height, and width, the first feature map M1: (128, 48, 16), the second feature map M2: (256, 24, 8), and the third feature map M3: (512, 12, 4) output by the second, third, and fourth downsampling layers in the feature map extraction module are obtained.

[0049] S202b: Input the first feature map, the second feature map, and the third feature map into the segmented grouping self-attention function in the cross-layer bidirectional inter-coding network to obtain the segmented grouping self-attention maps corresponding to the first feature map, the second feature map, and the third feature map, respectively.

[0050] In this embodiment, the cross-layer bidirectional inter-coding network may include a segmented group self-attention function (i.e., a SSA function). The first feature map, the second feature map, and the third feature map can be input into the segmented group self-attention function in the cross-layer bidirectional inter-coding network to obtain segmented group self-attention maps corresponding to the first feature map, the second feature map, and the third feature map, respectively.

[0051] Next, we will introduce how the SSA function performs segmented grouping self-attention processing on the feature map to obtain the segmented grouping self-attention map corresponding to the feature map.

[0052] In this embodiment, the first feature map, the second feature map, and the third feature map are respectively used as target feature maps and input into the SSA function.

[0053] In one implementation, the target feature map can be first input into the segmented grouping self-attention function in the cross-layer bidirectional inter-encoding network. The segmented grouping self-attention function is then used to perform segmented and grouped self-attention encoding on the target feature map along the second dimension, resulting in several sets of segmented grouping self-attention feature vectors. For example, segmented and grouping self-attention encoding is performed on M1 along the second dimension (M1 dimension is (128, 48, 16), where the second dimension refers to the dimension 48). Note that the second dimension of M1 is 48. Each 4 cells in the second dimension are divided into a group (i.e., a sub-image region in the feature map), resulting in a total of 12 sets of sub-image regions. Each set of sub-image regions is (128, 4, 16), and each set of sub-image regions is used as the segmented grouping self-attention feature vector corresponding to the target feature map. Each set of segmented grouping self-attention feature vectors is (128, 4, 16).

[0054] Next, for each group of segmented self-attention feature vectors, the segmented self-attention function can be used to segment the feature vectors, resulting in a segmented self-attention map. Specifically, for each group of segmented self-attention feature vectors, the second and third dimensions can be flattened using the segmented self-attention function to obtain the segmented features of the feature vectors. Understandably, the same operation can be performed on each set of segmented self-attention feature vectors: flattening the second and third dimensions of the segmented self-attention feature vectors to obtain the segmented features of that segmented self-attention feature vector. Assuming a set of segmented self-attention feature vectors has dimensions (128, 4, 16), the third dimension refers to dimension 16. Flattening the second dimension (4) and the third dimension (16) together yields 64, resulting in a total of 4 * 16 = 64 vectors of 128 dimensions. Therefore, the segmented features of this segmented self-attention feature vector are A(64, 128). Next, the segmented self-attention function can be used to determine the first-order transformation segmented features of the segmented self-attention feature vector based on the segmented features and a preset first learnable parameter matrix. For example, the first-order transformation segmented features of the segmented self-attention feature vector can be obtained using the following formula.

[0055] B = A * W1

[0056] Where A is the segmented feature of the segmented group self-attention feature vector; W1 is the first learnable parameter matrix with dimensions (128, 384); and B is the first-order transformation segmented feature with dimensions (64, 384).

[0057] Next, the segmentation grouping self-attention function can be used to divide the first-order transform segmentation feature into a first-order transform segmentation feature, a second-order transform segmentation feature, and a third-order transform segmentation feature. Furthermore, based on the first-order transform segmentation feature, the second-order transform segmentation feature, and the third-order transform segmentation feature, the segmentation grouping self-attention function is used to obtain the segmentation grouping self-attention map of the segmentation grouping self-attention feature vector.

[0058] Specifically, the first-order transform segmentation feature, the second-order transform segmentation feature, and the third-order transform segmentation feature can each be divided into several groups of feature vectors of the same number. For the i-th group of feature vectors in the first-order transform segmentation feature, the second-order transform segmentation feature, and the third-order transform segmentation feature, matrix multiplication can be performed on the i-th group of feature vectors in the first-order transform segmentation feature and the second-order transform segmentation feature to obtain a product matrix. Then, a softmax operation is performed on the product matrix, and then it is multiplied by the i-th group of feature vectors in the third-order transform segmentation feature to obtain the segmentation group self-attention map corresponding to the i-th group of feature vectors, where i is a positive integer. Then, the segmentation group self-attention maps corresponding to each of the several groups of feature vectors are stacked to obtain the segmentation group self-attention map of the segmentation group self-attention feature vector. For example, the first-order transformation segmentation feature B of the segmented self-attention feature vector can be divided into 3 groups: the first-order transformation segmentation feature C (64, 128), the second-order transformation segmentation feature D (64, 128), and the third-order transformation segmentation feature E (64, 128). C, D, and E are each further divided into 8 groups of feature vectors: [c1, c2, ..., c8], [d1, d2, ..., d8], [e1, e2, ..., e8], all with dimensions (64, 16). For the first group of feature vectors in the first, second, and third-order transformation segments, feature vectors c1 and d1 are first multiplied by matrix multiplication to obtain the product matrix N (64, 64). A softmax operation is then performed on N, and it is multiplied by e1 to obtain the segmented self-attention map f1 (64, 16) corresponding to the first group of feature vectors. The formula is:

[0059] f1 = softmax(c1 * d1) T )*e1

[0060] Similarly, the same operation is performed on [c2,d2,e2], [c3,d3,e3], ..., [c8,d8,e8] as on [c1,d1,e1] to obtain f2, f3, ..., f8. Finally, [f1,f2,f3, ...,f8] are stacked (that is, the segmented group self-attention maps corresponding to these 8 sets of feature vectors are stacked) to obtain the final calculation result, which is the segmented group self-attention map of the segmented group self-attention feature vector. The dimension of the segmented group self-attention map is (64, 128).

[0061] Finally, using the segmented grouping self-attention function, the segmented grouping self-attention map corresponding to the target feature map is determined based on the segmented grouping self-attention maps corresponding to each of the several groups of segmented grouping self-attention feature vectors. For example, the segmented grouping self-attention maps corresponding to each group of segmented grouping self-attention feature vectors corresponding to the target feature map can be stacked to obtain the segmented grouping self-attention map corresponding to the target feature map. In this way, the segmented grouping self-attention maps corresponding to the first feature map, the second feature map, and the third feature map can be obtained respectively. Specifically, the segmented grouping self-attention map corresponding to the first feature map is as follows: The segmented grouping self-attention map corresponding to the second feature map The segmented grouping self-attention map corresponding to the third feature map In this context, W1, W2, and W3 are all first learnable parameter matrices of dimension (128, 384). 4, 2, and 1 represent grouping the feature vectors into 4, 2, and 1 cells respectively, and 8 represents calculating the segmented grouped self-attention map for each of the 8 groups of feature vectors. It should be noted that the parameters in the SSA function are variable; for example, the grouping in the SSA function can be 8, 4, or 16 groups, meaning C, D, and E can be divided into 8, 4, or 16 groups.

[0062] S202c: Use the segmented grouping self-attention maps corresponding to the first feature map, the second feature map, and the third feature map respectively as the segmented grouping self-attention information of the human image to be matched.

[0063] After obtaining the segmented grouping self-attention maps corresponding to the first feature map, the second feature map, and the third feature map respectively, the segmented grouping self-attention maps corresponding to the first feature map, the second feature map, and the third feature map respectively can be used as the segmented grouping self-attention information of the human image to be matched.

[0064] Next, we will introduce one implementation method of "fusing and encoding the segmented group self-attention information of the human image to be matched to obtain the human feature vector and pose value of the human image to be matched" in S202. That is, in this embodiment, the step of fusing and encoding the segmented group self-attention information of the human image to be matched to obtain the human feature vector and pose value of the human image to be matched may include the following steps:

[0065] S202A: Input the segmented grouped self-attention maps corresponding to the first feature map, the second feature map, and the third feature map respectively into the human feature module in the cross-layer bidirectional inter-coding network to obtain the human feature vector of the human image to be matched;

[0066] In this embodiment, the cross-layer bidirectional inter-encoding network includes a human feature module, which can input the segmented grouped self-attention maps corresponding to the first feature map, the second feature map, and the third feature map into the human feature module of the cross-layer bidirectional inter-encoding network to obtain the human feature vector of the human image to be matched.

[0067] Specifically, the human feature module can be used to obtain the first segmented cross-layer fusion encoding based on the segmented grouping self-attention maps corresponding to the first feature map and the second feature map, as well as the preset second learnable matrix.

[0068] As an example, step 1: The human feature module can segment and group the self-attention maps corresponding to the first feature map. The feature maps are divided into several (e.g., 12) groups of self-attention grouped feature maps, each group of self-attention grouped feature maps being G1(128, 4, 16); the second feature map corresponds to the segmented grouped self-attention map. Divide into several (e.g., 12) groups of self-attention grouped feature maps, each group of self-attention grouped feature maps is G2(256, 2, 8), and the third feature map corresponds to the segmented grouped self-attention map. Divide the feature map into several (e.g., 12) groups of self-attention grouped feature maps, each group of self-attention grouped feature maps is G3(512, 1, 4); Step 2: The human feature module transforms the dimension of G1 to (64, 128), and then multiplies it by the second learnable matrix of dimension (128, 256). The first self-attention group feature map G′1 is obtained, with dimensions (64, 256). Specifically, the formula is:

[0069] Step 3: The human feature module can transform the dimension of G2 to (16, 256), then multiply G2 by the transpose of G′1, perform a softmax operation, and then multiply by G′1 again to obtain the second self-attention group feature map G′2. The specific formula is: G′2 = softmax(G2 * G′1) T )*G′1.

[0070] Step 4: The human feature module can... Each of the several (e.g., 12) groups of self-attention grouping feature maps and Each of the self-attention group feature maps in several (e.g., 12) groups of self-attention group feature maps is processed by steps 2 and 3, which yields several (e.g., 12) groups of G′2. Stacking these several (e.g., 12) groups of G′2 yields the first segmented cross-layer fusion encoding R1 with dimensions (12, 16, 256).

[0071] It should be noted that in this implementation, steps 1 to 4 are defined as the "segmentation of the cross-layer shallow-deep fusion coding function," which can be called the SCFC function. As shown above, its parameter is 12 (the number of segments, i.e., the number of self-attention grouped feature maps). The calculation process above begins with... and The process of "segmented cross-layer shallow-deep fusion coding" is completed, which is the process of cross-layer coding from shallow to deep. The formula is:

[0072] Then, the human body feature module can be used to perform segmentation grouping self-attention maps based on the first segmented cross-layer fusion encoding R1 and the segmented grouping self-attention map corresponding to the third feature map. and the pre-defined third learnable matrix The second segmented cross-layer fusion encoding R2 is obtained.

[0073] As an example, in step 5, the human feature module multiplies several (e.g., 12) groups of second self-attention grouped feature maps G′2 in the first segmented cross-layer fusion encoding R1 by a third learnable matrix of dimension (256, 512). The third self-attention group feature map G″2 is obtained, with dimensions (16, 512). The formula is:

[0074] Step 6: The human feature module transforms the dimension of G3 to (4, 512), multiplies G3 by the transpose of G″2, performs a softmax operation, and then multiplies it again by G″2 to obtain the fourth self-attention group feature map G′3, with the formula: G′3 = softmax(G3 * G″2) T)*G″2.

[0075] Step 7, the human feature module will Each of the several (e.g., 12) groups of self-attention grouped feature maps undergoes the operation described in step 6 above, resulting in several (e.g., 12) groups of G′3. Stacking these (e.g., 12) groups of G′3 yields a second segmented cross-layer fusion encoding R2 with dimensions (12, 4, 512), where the dimensions of the second segmented cross-layer fusion encoding R2 can be (12, 4, 512). The above process can be understood as based on the fused R1 and... The process of completing the "segmented cross-layer shallow-deep fusion encoding" can be expressed by the following formula:

[0076]

[0077] Finally, the human feature module can be used to determine the human feature vector of the human image to be matched based on the second segmented cross-layer fusion encoding. For example, the second segmented cross-layer fusion encoding R2 can be averaged along the second dimension to obtain several (e.g., 12) human feature vectors with dimensions (12, 512). It should be noted that since the human feature vector is mainly composed of high-dimensional information and supplemented by shallow-dimensional information, the human feature vector is obtained from R2 through a shallow-deep encoding method.

[0078] S202B: Input the segmented grouped self-attention maps corresponding to the first feature map, the second feature map, and the third feature map respectively into the pose value module in the cross-layer bidirectional inter-coding network to obtain the pose value of the human image to be matched.

[0079] In this embodiment, the cross-layer bidirectional inter-coding network includes a pose value module. In this embodiment, the segmented grouped self-attention maps corresponding to the first feature map, the second feature map, and the third feature map can be input into the pose value module of the cross-layer bidirectional inter-coding network to obtain the pose value of the human image to be matched. It is understood that this step requires changing the fusion direction and altering the learnable matrix dimension to enable deep-to-shallow fusion encoding. This step can be understood as performing "segmented cross-layer deep-shallow fusion encoding".

[0080] As an example, the pose value module can be used to obtain the first self-attention group feature map based on the segmented group self-attention maps corresponding to the second and third feature maps, respectively, and a preset fourth learnable matrix. Specifically, this step can be used to obtain the first self-attention group feature map T1 using the following formula:

[0081]

[0082] in, The fourth learnable matrix has dimensions (512, 256); 12 represents the number of segments of the segmented self-attention maps corresponding to the second and third feature maps, i.e., the number of segments of the self-attention grouped feature maps.

[0083] Then, using this pose value module, a second self-attention group feature map can be obtained based on the first self-attention group feature map, the segmented group self-attention map corresponding to the first feature map, and the preset fifth learnable matrix. This step can be understood as performing "segmented cross-layer deep-shallow fusion encoding," and specifically, the second self-attention group feature map T2 can be obtained using the following formula:

[0084]

[0085] in, T2 is the fifth learnable matrix with dimensions (256, 128); 12 represents the number of segments in the segmented self-attention map corresponding to the first feature map, i.e., the number of self-attention grouped feature maps. The dimensions of T2 can be (12, 64, 128).

[0086] Using this pose value module, the pose feature vector of the human image to be matched is determined based on the second self-attention group feature map. For example, using this pose value module, the second self-attention group feature map is average pooled along the second dimension to obtain several (e.g., 12) pose feature vectors with dimensions (12, 128).

[0087] Using this pose value module, the pose value of the human image to be matched is determined based on the pose feature vector of the human image to be matched and the preset sixth learnable matrix parameters. For example, the pose feature vector of the human image to be matched can be multiplied by the sixth learnable matrix parameters with a dimension of (128, 1), and then the product can be subjected to a sigmoid operation to obtain the pose values of the 12 pose feature vectors after normalization with a dimension of (12, 1).

[0088] It should be noted that human pose is mainly composed of shallow-dimensional information, supplemented by high-dimensional information. Therefore, the pose value is obtained from T2 through deep-shallow encoding.

[0089] Next, we will introduce one implementation method of S203 "determining the similarity between the two human images to be matched based on the human feature vectors and pose values corresponding to each of the two human images to be matched respectively". That is, in this embodiment, the step of determining the similarity between the two human images to be matched based on the human feature vectors and pose values corresponding to each of the two human images to be matched respectively may include the following steps:

[0090] S203a: Normalize and perform dot product processing on the human feature vectors corresponding to the two human images to be matched to obtain the dot product value.

[0091] As an example, the dot product of the human feature vectors corresponding to the two images to be matched is obtained after normalizing them. For example, the dot product can be determined using the following formula:

[0092]

[0093] Where dot is the dot product value; norm() is the normalization function; Let i be the human feature vector of the i-th person in a human image to be matched; Let i be the human feature vector of the i-th person in another human image to be matched.

[0094] S203b: Determine the weight value based on the absolute difference between the pose values corresponding to the two human body images to be matched.

[0095] In this embodiment, the dot product value of each corresponding feature needs to be weighted "by pose value". This weight value is obtained by "normalizing" the absolute difference between the two poses. For example, the weight value can be determined by the following formula:

[0096]

[0097] Where weight is the weight value; 12 means that both human images to be matched have 12 pose values. Let i be the pose value of the i-th human image to be matched; Let i be the pose value of another human image to be matched; Let j be the pose value of the human image to be matched; Let j be the pose value of another human image to be matched.

[0098] S203c: Determine the similarity between the two human images to be matched based on the dot product value and the weight value.

[0099] In this embodiment, the product of the dot product and the weight value can be used as the similarity between two human images to be matched. For example, the similarity can be obtained using the following formula:

[0100]

[0101] Where, similarity represents the similarity between two human images to be matched; norm() is the normalization calculation function; Let i be the human feature vector of the i-th person in a human image to be matched; Let be the i-th human feature vector of another human image to be matched; 12 represents that both human images to be matched have 12 pose values; Let i be the pose value of the i-th human image to be matched; Let i be the pose value of another human image to be matched; Let j be the pose value of the human image to be matched; Let j be the pose value of another human image to be matched.

[0102] It should be noted that the similarity formula (a) above is defined as the PSF function in this embodiment. Therefore, the similarity between the human image P1 and the human image P2 to be matched can be expressed as:

[0103] similarity = PSF(R) p1 ,R p2 ,T p1 T p2 )

[0104] Among them, R p1 R represents the human feature vector of the human image P1 to be matched; p2 T represents the human feature vector of the human image P2 to be matched; p1 T represents the pose value of the human image P1 to be matched; p2 This represents the pose value of the human image P2 to be matched. If the similarity between the two images is greater than the comparison threshold, then the two people are the same pedestrian; if it is less than the threshold, they are not the same pedestrian.

[0105] All of the above-mentioned optional technical solutions can be combined in any way to form optional embodiments of this disclosure, and will not be described in detail here.

[0106] The following are embodiments of the apparatus disclosed herein, which can be used to execute embodiments of the method disclosed herein. For details not disclosed in the apparatus embodiments of this disclosure, please refer to the embodiments of the method disclosed herein.

[0107] Figure 3 This is a schematic diagram of the human image matching device provided in an embodiment of this disclosure. Figure 3 As shown, the human image matching device includes:

[0108] Image acquisition unit 301 is used to acquire two human body images to be matched;

[0109] The information acquisition unit 302 is used to determine the segmented grouping self-attention information of each human image to be matched based on the human image to be matched; and to perform fusion encoding processing on the segmented grouping self-attention information of the human image to be matched to obtain the human feature vector and pose value of the human image to be matched.

[0110] The similarity determination unit 303 is used to determine the similarity between the two human images to be matched based on the human feature vectors and pose values corresponding to the two human images to be matched respectively.

[0111] The result determination unit 304 is used to determine the matching result of the two human images to be matched based on the similarity between the two images.

[0112] Optionally, the information acquisition unit 302 is used for:

[0113] The human image to be matched is input into the feature map extraction module in the cross-layer bidirectional inter-coding network to obtain the first feature map, the second feature map and the third feature map;

[0114] The first feature map, the second feature map, and the third feature map are respectively input into the segmented grouping self-attention function in the cross-layer bidirectional inter-coding network to obtain the segmented grouping self-attention maps corresponding to the first feature map, the second feature map, and the third feature map, respectively.

[0115] The segmented grouping self-attention maps corresponding to the first feature map, the second feature map, and the third feature map are used as the segmented grouping self-attention information of the human image to be matched.

[0116] Optionally, the information acquisition unit 302 is configured to input the first feature map, the second feature map, and the third feature map into the segmented grouping self-attention function in the cross-layer bidirectional inter-coding network, respectively, to obtain segmented grouping self-attention maps corresponding to the first feature map, the second feature map, and the third feature map, including:

[0117] The first feature map, the second feature map, and the third feature map are respectively used as target feature maps;

[0118] The target feature map is input into the segmented grouping self-attention function in this cross-layer bidirectional inter-encoding network;

[0119] Using this segmented group self-attention function, the target feature map is segmented and grouped along the second dimension to obtain several segmentsed group self-attention feature vectors;

[0120] For each group of segmented self-attention feature vectors, the segmented self-attention function is used to segment the segmented self-attention feature vectors to obtain the segmented self-attention map of the segmented self-attention feature vectors.

[0121] Using this segmented group self-attention function, based on the segmented group self-attention maps corresponding to each of the several groups of segmented group self-attention feature vectors, the segmented group self-attention map corresponding to the target feature map is determined.

[0122] Optionally, the information acquisition unit 302 is used for:

[0123] For each segmented group self-attention feature vector, the second and third dimensions of the segmented group self-attention feature vector are flattened using the segmented group self-attention function to obtain the segmented features of the segmented group self-attention feature vector.

[0124] The first-order transformation cutting feature of the segmented group self-attention feature vector is determined by the segmentation feature of the segmented group self-attention feature vector and the preset first learnable parameter matrix using the segmentation group self-attention function.

[0125] The segmentation grouping self-attention function is used to divide the first-order transform segmentation feature into a first-order transform segmentation feature, a second-order transform segmentation feature, and a third-order transform segmentation feature.

[0126] Using the segmented group self-attention function, based on the first first-order transform segmentation feature, the second first-order transform segmentation feature, and the third first-order transform segmentation feature, the segmented group self-attention feature vector is obtained as a segmented group self-attention map.

[0127] Optionally, the information acquisition unit 302 is used for:

[0128] The segmented grouping self-attention maps corresponding to the first feature map, the second feature map, and the third feature map are respectively input into the human feature module of the cross-layer bidirectional inter-encoding network to obtain the human feature vector of the human image to be matched.

[0129] The segmented grouped self-attention maps corresponding to the first feature map, the second feature map, and the third feature map are input into the pose value module in the cross-layer bidirectional inter-encoding network to obtain the pose value of the human image to be matched.

[0130] Optionally, the information acquisition unit 302 is used for:

[0131] Using the human body feature module, based on the segmented group self-attention maps corresponding to the first feature map and the second feature map respectively, and the preset second learnable matrix, the first segmented cross-layer fusion code is obtained;

[0132] Using the human body feature module, a second segmented cross-layer fusion code is obtained based on the first segmented cross-layer fusion code, the segmented group self-attention map corresponding to the third feature map, and the preset third learnable matrix;

[0133] Using this human body feature module, the human body feature vector of the human body image to be matched is determined according to the second segmented cross-layer fusion encoding.

[0134] Optionally, the information acquisition unit 302 is used for:

[0135] Using the pose value module, the first self-attention group feature map is obtained based on the segmented group self-attention maps corresponding to the second feature map and the third feature map respectively, as well as the preset fourth learnable matrix.

[0136] Using the pose value module, a second self-attention group feature map is obtained based on the first self-attention group feature map, the segmented group self-attention map corresponding to the first feature map, and the preset fifth learnable matrix;

[0137] Using the pose value module, the pose feature vector of the human image to be matched is determined based on the second self-attention group feature map;

[0138] Using this pose value module, the pose value of the human image to be matched is determined based on the pose feature vector of the human image to be matched and the preset sixth learnable matrix parameters.

[0139] Optionally, the similarity determination unit 303 is used for:

[0140] Normalize and perform dot product processing on the human feature vectors corresponding to the two human images to be matched to obtain the dot product value;

[0141] The weight value is determined based on the absolute difference between the pose values corresponding to the two human images to be matched.

[0142] The similarity between the two human images to be matched is determined based on the dot product value and the weight value.

[0143] Optionally, the result determination unit 304 is used for:

[0144] If the similarity between the two human images to be matched is equal to or greater than the preset similarity threshold, then the matching result of the two human images to be matched is determined to be images of the same user.

[0145] If the similarity between the two human images to be matched is less than the preset similarity threshold, then the matching result of the two human images to be matched is determined to be images of different users.

[0146] The technical solution provided in this disclosure is a human image matching device, which includes: an image acquisition unit for acquiring two human images to be matched; an information acquisition unit for determining, for each human image to be matched, segmented grouping self-attention information of the human image to be matched; performing fusion encoding processing on the segmented grouping self-attention information of the human image to be matched to obtain human feature vectors and pose values of the human image to be matched; a similarity determination unit for determining the similarity between the two human images to be matched based on the human feature vectors and pose values corresponding to the two human images to be matched respectively; and a result determination unit for determining the matching result between the two human images to be matched based on the similarity between the two human images to be matched. Since this embodiment extracts the human feature vector and pose value of each human image to be matched, and determines the similarity between two human images to be matched based on the human feature vector and pose value of each human image to be matched, that is, the similarity between two human images to be matched in this embodiment is determined based on the human feature vector and pose value of each human image to be matched; therefore, the method of determining the similarity between two human images to be matched can be used for human image matching under different poses. In this way, the method of determining the similarity between two human images to be matched is more flexible and accurate, thereby making the human image matching result determined based on the similarity more accurate.

[0147] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this disclosure.

[0148] Figure 4 This is a schematic diagram of the computer device 4 provided in an embodiment of this disclosure. Figure 4 As shown, the computer device 4 in this embodiment includes a processor 401, a memory 402, and a computer program 403 stored in the memory 402 and executable on the processor 401. When the processor 401 executes the computer program 403, it implements the steps in the various method embodiments described above. Alternatively, when the processor 401 executes the computer program 403, it implements the functions of each module / unit in the various device embodiments described above.

[0149] Exemplarily, computer program 403 may be divided into one or more modules / units, which are stored in memory 402 and executed by processor 401 to perform the present disclosure. The one or more modules / units may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of computer program 403 in computer device 4.

[0150] Computer device 4 can be a desktop computer, laptop, handheld computer, cloud server, or other similar computer device. Computer device 4 may include, but is not limited to, processor 401 and memory 402. Those skilled in the art will understand that... Figure 4 This is merely an example of computer device 4 and does not constitute a limitation on computer device 4. It may include more or fewer components than shown, or combine certain components, or different components. For example, computer device may also include input / output devices, network access devices, buses, etc.

[0151] Processor 401 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.

[0152] The memory 402 can be an internal storage unit of the computer device 4, such as a hard disk or RAM of the computer device 4. The memory 402 can also be an external storage device of the computer device 4, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card equipped on the computer device 4. Furthermore, the memory 402 can include both internal and external storage units of the computer device 4. The memory 402 is used to store computer programs and other programs and data required by the computer device. The memory 402 can also be used to temporarily store data that has been output or will be output.

[0153] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this disclosure. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0154] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0155] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this disclosure.

[0156] In the embodiments provided in this disclosure, it should be understood that the disclosed apparatus / computer devices and methods can be implemented in other ways. For example, the apparatus / computer device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. Multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0157] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0158] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0159] If an integrated module / unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program may include computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. A computer-readable medium may include: any entity or device capable of carrying computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in a computer-readable medium may be appropriately added to or subtracted according to the requirements of legislation and patent practice in a jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.

[0160] The above embodiments are only used to illustrate the technical solutions of this disclosure, and are not intended to limit it. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure, and should all be included within the protection scope of this disclosure.

Claims

1. A method for matching human body images, characterized in that, The method includes: Obtain two human images to be matched; For each human image to be matched, the segmented grouping self-attention information of the human image to be matched is determined based on the human image to be matched; the segmented grouping self-attention information of the human image to be matched is fused and encoded to obtain the human feature vector and pose value of the human image to be matched. The similarity between the two human images to be matched is determined based on their respective human feature vectors and pose values. The matching result of the two human images to be matched is determined based on the similarity between them. The step of determining the segmented grouping self-attention information of the human image to be matched based on the human image to be matched includes: The human image to be matched is input into the feature map extraction module in the cross-layer bidirectional inter-coding network to obtain the first feature map, the second feature map and the third feature map; The first feature map, the second feature map, and the third feature map are respectively used as target feature maps; The target feature map is input into the segmented grouping self-attention function in the cross-layer bidirectional inter-coding network; Using the segmented group self-attention function, the target feature map is segmented and grouped along the second dimension to obtain several sets of segmented group self-attention feature vectors; For each group of segmented self-attention feature vectors, the segmented self-attention function is used to segment the segmented self-attention feature vectors to obtain the segmented self-attention map of the segmented self-attention feature vectors. Using the segmented group self-attention function, based on the segmented group self-attention maps corresponding to each of the several groups of segmented group self-attention feature vectors, the segmented group self-attention map corresponding to the target feature map is determined to obtain the segmented group self-attention maps corresponding to the first feature map, the second feature map, and the third feature map, respectively. The segmented grouping self-attention maps corresponding to the first feature map, the second feature map, and the third feature map are respectively used as the segmented grouping self-attention information of the human image to be matched; For each group of segmented self-attention feature vectors, the segmented self-attention function is used to segment the segmented self-attention feature vectors to obtain a segmented self-attention map of the segmented self-attention feature vectors, including: For each group of segmented self-attention feature vectors, the second and third dimensions of the segmented self-attention feature vectors are flattened using the segmented self-attention function to obtain the segmented features of the segmented self-attention feature vectors. The first-order transformation cutting feature of the segmented group self-attention feature vector is determined using the segmented feature of the segmented group self-attention feature vector and a preset first learnable parameter matrix. The first-order transform segmentation features are divided into first-order transform segmentation features, second-order transform segmentation features, and third-order transform segmentation features using the segmentation grouping self-attention function. Using the segmented group self-attention function, based on the first first-order transform segmentation feature, the second first-order transform segmentation feature, and the third first-order transform segmentation feature, a segmented group self-attention map of the segmented group self-attention feature vector is obtained.

2. The method according to claim 1, characterized in that, The step of fusing and encoding the segmented grouped self-attention information of the human image to be matched to obtain the human feature vector and pose value of the human image to be matched includes: The segmented grouping self-attention maps corresponding to the first feature map, the second feature map, and the third feature map are input into the human feature module in the cross-layer bidirectional inter-coding network to obtain the human feature vector of the human image to be matched. The segmented grouped self-attention maps corresponding to the first feature map, the second feature map, and the third feature map are input into the pose value module in the cross-layer bidirectional inter-coding network to obtain the pose value of the human image to be matched.

3. The method according to claim 2, characterized in that, The step of inputting the segmented grouped self-attention maps corresponding to the first feature map, the second feature map, and the third feature map respectively into the human feature module of the cross-layer bidirectional inter-coding network to obtain the human feature vector of the human image to be matched includes: Using the human body feature module, a first segmented cross-layer fusion encoding is obtained based on the segmented grouping self-attention maps corresponding to the first feature map and the second feature map, respectively, and a preset second learnable matrix. Using the human body feature module, a second segmented cross-layer fusion code is obtained based on the first segmented cross-layer fusion code, the segmented group self-attention map corresponding to the third feature map, and a preset third learnable matrix; Using the human body feature module, the human body feature vector of the human body image to be matched is determined according to the second segmented cross-layer fusion encoding.

4. The method according to claim 2, characterized in that, The step of inputting the segmented grouped self-attention maps corresponding to the first feature map, the second feature map, and the third feature map respectively into the pose value module in the cross-layer bidirectional inter-coding network to obtain the pose value of the human image to be matched includes: Using the pose value module, a first self-attention group feature map is obtained based on the segmented group self-attention maps corresponding to the second feature map and the third feature map, respectively, and a preset fourth learnable matrix. Using the pose value module, a second self-attention group feature map is obtained based on the first self-attention group feature map, the segmented group self-attention map corresponding to the first feature map, and a preset fifth learnable matrix; Using the pose value module, the pose feature vector of the human image to be matched is determined based on the second self-attention group feature map; Using the pose value module, the pose value of the human image to be matched is determined based on the pose feature vector of the human image to be matched and the preset sixth learnable matrix parameters.

5. The method according to claim 1, characterized in that, The step of determining the similarity between the two human images to be matched based on their respective human feature vectors and pose values includes: The human feature vectors corresponding to the two human images to be matched are normalized and processed by dot product to obtain the dot product value. The weight value is determined based on the absolute difference between the pose values corresponding to the two human body images to be matched. The similarity between the two human images to be matched is determined based on the dot product value and the weight value.

6. The method according to any one of claims 1-5, characterized in that, The step of determining the matching result of the two human images to be matched based on their similarity includes: If the similarity between the two human images to be matched is equal to or greater than a preset similarity threshold, then the matching result of the two human images to be matched is determined to be images of the same user. If the similarity between the two human images to be matched is less than the preset similarity threshold, then the matching result of the two human images to be matched is determined to be that the two human images to be matched are images of different users.

7. A human image matching device for implementing the method described in any one of claims 1-6, characterized in that, The device includes: The image acquisition unit is used to acquire two human images to be matched; The information acquisition unit is used to determine the segmented grouping self-attention information of each human image to be matched based on the human image to be matched; and to perform fusion encoding processing on the segmented grouping self-attention information of the human image to be matched to obtain the human feature vector and pose value of the human image to be matched. The similarity determination unit is used to determine the similarity between the two human images to be matched based on the human feature vectors and pose values corresponding to the two human images to be matched, respectively. The result determination unit is used to determine the matching result of the two human images to be matched based on the similarity between the two images.

8. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 6.