A video pedestrian shoe type retrieval method and system fusing local and global information
By transforming shoe images to the same perspective and collecting key points, visual signaling is constructed and similarity is calculated, solving the problem of low efficiency in manual comparison and achieving efficient and accurate retrieval of suspect shoe types in videos.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
- Filing Date
- 2023-03-31
- Publication Date
- 2026-06-16
AI Technical Summary
In existing technologies, manual comparison of shoe models is inefficient, consumes a lot of manpower and time, and is difficult to efficiently retrieve the shoe model information of suspects in videos.
By acquiring shoe images to be matched and shoe images in a preset image library, spatial transformation parameters are used to transform them to the same viewpoint, pedestrian key points are collected, visual signals are constructed and shoe features are generated, and similarity is calculated to rank shoe images in the image library.
It improves the efficiency and accuracy of shoe image comparison, reduces manual intervention, and enhances retrieval efficiency and accuracy.
Smart Images

Figure CN117556071B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of criminal investigation technology, and in particular to a method and system for retrieving the shoe type worn by pedestrians in video footage that integrates local and global information. Background Technology
[0002] In the current field of criminal investigation, determining the type of shoes worn by a suspect based on on-site shoe prints, and then retrieving the suspect's image from video surveillance based on the shoe image, has become an important task for public security investigation agencies.
[0003] However, existing technologies typically use manual comparison to determine whether shoe models match, but manual screening consumes a lot of manpower and time and is inefficient. Summary of the Invention
[0004] In view of this, embodiments of the present invention provide a method for retrieving the shoe type worn by pedestrians in video by fusing local and global information, so as to eliminate or improve one or more defects existing in the prior art.
[0005] One aspect of the present invention provides a method for retrieving the shoe type worn by pedestrians in video footage by fusing local and global information, the method comprising the following steps:
[0006] Obtain the shoe image to be matched and the shoe images in the preset image library, and transform the shoe image into a shoe image from the same viewpoint based on the spatial transformation parameters;
[0007] The shoe image, transformed to the same perspective, is input into a preset pedestrian key point collector, which outputs a set of pedestrian key point coordinates for each shoe image.
[0008] Based on the coordinate set of the pedestrian key points corresponding to the shoe image, pixel blocks are extracted from the shoe image, and a first visual signaling corresponding to each pedestrian key point is constructed based on the extracted pixel blocks. Multiple first visual signals are input into a preset visual signaling set generator to obtain shoe features corresponding to each shoe image.
[0009] Calculate the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the image library, and sort the shoe images in the image library based on the similarity.
[0010] Using the above scheme, the shoe images are transformed into shoe images from the same perspective, facilitating comparison between images. Key points are then collected for each shoe image, and pixel blocks are extracted based on these key points to construct a first visual signal. Based on these key points, the most representative region of each shoe image can be collected, improving comparison efficiency. Shoe features are then obtained, which integrate multiple first visual signals. Similarity calculations are performed on these shoe features to directly obtain images that are similar to the shoe image to be matched. This scheme improves comparison efficiency while further enhancing the accuracy of similarity calculation through shoe image transformation, key point collection, construction of first visual signals and shoe features, and subsequent similarity calculations.
[0011] In some embodiments of the present invention, the step of extracting pixel blocks from the shoe image based on the pedestrian key point coordinate set corresponding to the shoe image, and constructing a first visual signaling based on the extracted pixel blocks from the key points includes:
[0012] A predetermined number of pixel blocks are extracted from the shoe image that are closest to each of the pedestrian key points;
[0013] The pixel blocks extracted based on each pedestrian keypoint are combined into a first visual signal.
[0014] In some embodiments of the present invention, in the step of inputting a plurality of first visual signals into a preset visual signaling set generator to obtain shoe features corresponding to each shoe image, a preset feature observation signal is obtained, the feature observation signal is combined with a plurality of first visual signals to form a visual signaling set, and the visual signaling set is input into the visual signaling set generator to obtain shoe features corresponding to each shoe image.
[0015] In some embodiments of the present invention, the visual conversion encoder includes two layer normalization modules, one multi-head attention module, one multilayer perceptron module, and two addition modules. After the visual signaling set is input into the visual conversion encoder, it first undergoes normalization processing through the layer normalization module; then, it undergoes attention feature extraction through the multi-head attention module; then, it undergoes addition processing through the addition module to add the attention features to the original visual signaling set; then, it undergoes normalization processing through the second layer normalization module; then, it undergoes processing through the multilayer perceptron module; finally, it undergoes addition processing through the second addition module to obtain the updated feature observation signaling and the first visual signaling.
[0016] In some embodiments of the present invention, the visual signaling generator includes a plurality of visual transformation encoders connected in sequence. In the step of inputting the visual signaling set into the visual signaling set generator to obtain shoe features corresponding to each shoe image, the feature observation signaling and the first visual signaling in the visual signaling set are updated and output after passing through each visual transformation encoder. The updated feature observation signaling output by the last visual transformation encoder in the visual signaling generator is used as the shoe feature.
[0017] In some embodiments of the present invention, in the step of calculating the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the image library, the shoe features corresponding to each shoe image in the image library are constructed into a temporary database, and the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the temporary database is calculated.
[0018] In some embodiments of the present invention, in the step of calculating the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the temporary database, the similarity is calculated according to the following formula:
[0019]
[0020] Among them, F Q A matrix representing the shoe features corresponding to the shoe image to be matched. This represents the matrix representing the feature of the i-th shoe in the temporary database. The similarity (F) between the matrix representing the shoe features corresponding to the shoe image to be matched and the matrix representing the i-th shoe features in the temporary database is given by the similarity (F). Q ) T The matrix representing the shoe features corresponding to the shoe image to be matched is transposed, and || represents the norm of the matrix.
[0021] In some embodiments of the present invention, in the step of transforming the shoe image into a shoe image from the same viewpoint based on spatial transformation parameters, the shoe image is transformed into a shoe image from the same viewpoint according to the following formula:
[0022]
[0023] in, This represents the x-coordinate of any pixel block in the i-th shoe image in the image library before transformation. This represents the x-coordinate of any pixel block in the i-th shoe image in the image library after transformation. This represents the ordinate of any pixel block in the i-th shoe image in the image library before transformation. θ represents the transformed ordinate of any pixel block in the i-th shoe image in the image library. 11 θ 12 θ 13 θ 21 θ 22 and θ 23 All of these are preset spatial conversion parameters.
[0024] In some embodiments of the present invention, the step of sorting shoe images in the image library based on similarity further includes obtaining a preset number of shoe images in the image library that have a high similarity to the shoe image to be matched.
[0025] A second aspect of the present invention also provides a video pedestrian shoe type retrieval system that integrates local and global information, the system comprising:
[0026] The shoe image transformation module acquires the shoe image to be matched and the shoe images in the preset image library, and transforms the shoe image into a shoe image from the same viewpoint based on spatial transformation parameters;
[0027] The key point acquisition module inputs the shoe image transformed to the same viewpoint into a preset pedestrian key point acquisition device, and the pedestrian key point acquisition device outputs a set of pedestrian key point coordinates for each shoe image.
[0028] The visual signaling set generation module extracts pixel blocks from the shoe image based on the coordinate set of the pedestrian key points corresponding to the shoe image, and constructs a first visual signaling corresponding to each pedestrian key point based on the extracted pixel blocks. Multiple first visual signals are input into a preset visual signaling set generator to obtain shoe features corresponding to each shoe image.
[0029] The similarity calculation module calculates the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the image library, and sorts the shoe images in the image library based on the similarity.
[0030] A third aspect of the present invention also provides a video pedestrian shoe type retrieval device that integrates local and global information. The device includes a computer device, which includes a processor and a memory. The memory stores computer instructions, and the processor executes the computer instructions stored in the memory. When the computer instructions are executed by the processor, the device performs the steps of the method described above.
[0031] A fourth aspect of the present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the aforementioned video pedestrian shoe type retrieval method that integrates local and global information.
[0032] Additional advantages, objects, and features of the invention will be set forth in part in the description which follows, and will also become apparent in part to those skilled in the art upon studying the text, or may be learned by practice of the invention. The objects and other advantages of the invention will become apparent from the description and the accompanying drawings.
[0033] Those skilled in the art will understand that the objectives and advantages achievable with the present invention are not limited to those specifically described above, and that the above and other objectives achievable with the present invention will become clearer from the following detailed description. Attached Figure Description
[0034] The accompanying drawings, which are provided to further illustrate the invention and form part of this application, are not intended to limit the scope of the invention.
[0035] Figure 1 This is a schematic diagram of one embodiment of the video pedestrian shoe type retrieval method that integrates local and global information according to the present invention;
[0036] Figure 2 This is a schematic diagram illustrating another embodiment of the video pedestrian shoe type retrieval method that integrates local and global information according to the present invention;
[0037] Figure 3 This is a schematic diagram illustrating another embodiment of the video pedestrian shoe type retrieval method that integrates local and global information according to the present invention;
[0038] Figure 4 This is a schematic diagram of one embodiment of the video pedestrian shoe type retrieval system that integrates local and global information according to the present invention. Detailed Implementation
[0039] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and descriptions of this invention are used to explain the invention, but are not intended to limit the invention.
[0040] It should also be noted that, in order to avoid obscuring the invention with unnecessary details, only the structures and / or processing steps closely related to the solution according to the invention are shown in the accompanying drawings, while other details that are not closely related to the invention are omitted.
[0041] It should be emphasized that the term "including / comprises" as used herein refers to the presence of a feature, element, step, or component, but does not exclude the presence or addition of one or more other features, elements, steps, or components.
[0042] It should also be noted that, unless otherwise specified, the term "connection" in this article can refer not only to a direct connection, but also to an indirect connection involving an intermediary.
[0043] In the following description, embodiments of the invention will be illustrated with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar parts, or the same or similar steps.
[0044] To solve the above problems, such as Figure 1 , 2 As shown, this invention proposes a method for retrieving the shoe type worn by pedestrians in video footage by fusing local and global information. The method includes the following steps:
[0045] Step S100: Obtain the shoe image to be matched and the shoe image in the preset image library, and transform the shoe image into a shoe image from the same viewpoint based on the spatial transformation parameters;
[0046] like Figure 2 As shown, in some embodiments of the present invention, in the step of transforming the shoe image into a shoe image from the same viewpoint based on spatial transformation parameters, the shoe image to be matched and the shoe images in the preset image library are input into a preset spatial transformation parameter generator to obtain the spatial transformation parameters corresponding to the shoe image to be matched and each shoe image in the image library.
[0047] In the specific implementation, the spatial transformation parameter generator is a pre-trained convolutional neural network model, which includes at least one convolutional layer and one fully connected layer. The size and number of convolutional kernels can be flexibly configured. The network input is a shoe image, and the network output is a 6-dimensional spatial transformation parameter.
[0048] In practice, the spatial transformation parameters can also be directly preset spatial transformation parameters.
[0049] In the specific implementation process, a spatial converter is used to perform an affine transformation on the input shoe image to obtain a shoe image under a unified pose. The spatial converter uses spatial transformation parameters to obtain the correspondence between the coordinates of the original shoe image and the transformed shoe image; and the transformed shoe image is generated from the original shoe image using a bilinear interpolation method.
[0050] Step S200: The shoe image transformed to the same viewpoint is input into a preset pedestrian key point collector, and the pedestrian key point collector outputs a set of pedestrian key point coordinates for each shoe image.
[0051] In some embodiments of the present invention, the pedestrian key point collector is a lightweight convolutional neural network, such as the MobilenetV3 network. During the training of the pedestrian key point collector, the model parameters are learned by training with a preset training dataset, thereby obtaining a pedestrian key point detector with fixed parameters.
[0052] In the specific implementation process, the pedestrian key point collector outputs p 2 These pedestrian key points constitute a set of pedestrian key point coordinates, i.e., {(x... i ,y i ), i = 1, ..., p 2 The value of p can be 8, 10, or 12.
[0053] Step S300: Based on the coordinate set of the pedestrian key points corresponding to the shoe image, pixel blocks are extracted from the shoe image, and a first visual signal corresponding to each pedestrian key point is constructed based on the extracted pixel blocks. Multiple first visual signals are input into a preset visual signal set generator to obtain shoe features corresponding to each shoe image.
[0054] In some embodiments of the present invention, pedestrian key points are selected one by one from the set of pedestrian key point coordinates, and p is cropped from the spatially transformed shoe image with the pedestrian key point as the center. 2 Each image block is 16*16; the image blocks are projected one by one into a 1*768 first visual signal.
[0055] In the specific implementation process, a 1*768 location code is assigned to each first visual signal and added to the corresponding first visual signal.
[0056] Step S400: Calculate the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the image library, and sort the shoe images in the image library based on the similarity.
[0057] like Figure 2 As shown, in the specific implementation process, the shoe feature comparator compares the shoe features of the shoes worn by the suspect at the scene, i.e. the shoe features corresponding to the shoe image to be matched, with the shoe features corresponding to each shoe image in the image library one by one to obtain a set of retrieval results sorted by matching score.
[0058] Using the above scheme, the shoe images are transformed into shoe images from the same perspective, facilitating comparison between images. Key points are then collected for each shoe image, and pixel blocks are extracted based on these key points to construct a first visual signal. Based on these key points, the most representative region of each shoe image can be collected, improving comparison efficiency. Shoe features are then obtained, which integrate multiple first visual signals. Similarity calculations are performed on these shoe features to directly obtain images that are similar to the shoe image to be matched. This scheme improves comparison efficiency while further enhancing the accuracy of similarity calculation through shoe image transformation, key point collection, construction of first visual signals and shoe features, and subsequent similarity calculations.
[0059] like Figure 3 As shown, in some embodiments of the present invention, the steps of extracting pixel blocks from the shoe image based on the pedestrian keypoint coordinate set corresponding to the shoe image, and constructing a first visual signaling based on the keypoints extracted pixel blocks include:
[0060] Step S310: Extract a preset number of pixel blocks from the shoe image that are close to each of the pedestrian key points;
[0061] Step S320: Combine the pixel blocks captured based on each pedestrian key point into a first visual signal.
[0062] In the specific implementation process, the distance between the pedestrian key points and each pixel block is calculated based on their coordinates.
[0063] In some embodiments of the present invention, in the step of inputting multiple first visual signals into a preset visual signaling set generator to obtain shoe features corresponding to each shoe image, step S330 involves obtaining a preset feature observation signaling, combining the feature observation signaling with multiple first visual signals to form a visual signaling set, and inputting the visual signaling set into the visual signaling set generator to obtain shoe features corresponding to each shoe image.
[0064] In the specific implementation process, a separate 1*768 feature observation signaling is set up to obtain a total of p 2 +1 visual signaling, visual signaling set S.
[0065] In the specific implementation process, the visual signaling set is processed by L visual transformation encoders in the visual signaling set generator to obtain the visual signaling coding set S. L The visual signaling coding set S L Feature observation signaling coding in This refers to the characteristics of shoes.
[0066] In some embodiments of the present invention, the visual signaling generator includes a plurality of visual transformation encoders connected in sequence. In the step of inputting the visual signaling set into the visual signaling set generator to obtain shoe features corresponding to each shoe image, the feature observation signaling and the first visual signaling in the visual signaling set are updated and output after passing through each visual transformation encoder. The updated feature observation signaling output by the last visual transformation encoder in the visual signaling generator is used as the shoe feature.
[0067] In some embodiments of the present invention, each of the visual conversion encoders includes two layer normalization modules, one multi-head attention module, one multilayer perceptron module, and two addition modules. After the visual signaling set is input into the visual conversion encoder, it first undergoes normalization processing through the layer normalization module; then, the multi-head attention module extracts attention feature A; then, the addition module adds attention feature A to the original visual signaling set S to obtain feature F; then, it undergoes normalization processing through the second layer normalization module; then, the multilayer perceptron module processes it to obtain feature F'; finally, the second addition module adds feature F' to feature F to obtain the updated visual signaling set S. δ The updated visual signaling set includes the updated feature observation signaling and the first visual signaling.
[0068] In some embodiments of the present invention, the S δ S represents the output of the δ-th visual transformation encoder. δ Its size is the same as S.
[0069] In practice, the layer standardization module is a standard normalization processing layer in deep learning.
[0070] In practice, the multi-head attention module is a classic model in deep learning, used to calculate the relationship between each signal in the visual signaling set and other signals.
[0071] In practice, the multilayer perceptron module is a classic neural network model, in which each node is connected to all outputs of the previous layer, and the output of each node is connected to all inputs of the next layer.
[0072] In some embodiments of the present invention, in the step of calculating the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the image library, the shoe features corresponding to each shoe image in the image library are constructed into a temporary database, and the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the temporary database is calculated.
[0073] In some embodiments of the present invention, in the step of calculating the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the temporary database, the similarity is calculated according to the following formula:
[0074]
[0075] Among them, F Q A matrix representing the shoe features corresponding to the shoe image to be matched. This represents the matrix representing the feature of the i-th shoe in the temporary database. The similarity (F) between the matrix representing the shoe features corresponding to the shoe image to be matched and the matrix representing the i-th shoe features in the temporary database is given by the similarity (F). Q ) T The matrix representing the shoe features corresponding to the shoe image to be matched is transposed, and || represents the norm of the matrix.
[0076] In some embodiments of the present invention, in the step of transforming the shoe image into a shoe image from the same viewpoint based on spatial transformation parameters, the shoe image is transformed into a shoe image from the same viewpoint according to the following formula:
[0077]
[0078] in, This represents the x-coordinate of any pixel block in the i-th shoe image in the image library before transformation. This represents the x-coordinate of any pixel block in the i-th shoe image in the image library after transformation. This represents the ordinate of any pixel block in the i-th shoe image in the image library before transformation. θ represents the transformed ordinate of any pixel block in the i-th shoe image in the image library. 11 θ 12 θ 13 θ 21 θ 22 and θ 23 All of these are preset spatial conversion parameters.
[0079] In some embodiments of the present invention, the step of sorting shoe images in the image library based on similarity further includes obtaining a preset number of shoe images in the image library that have a high similarity to the shoe image to be matched.
[0080] like Figure 4 As shown, a second aspect of the present invention also provides a video pedestrian shoe type retrieval system that integrates local and global information, the system comprising:
[0081] The shoe image transformation module acquires the shoe image to be matched and the shoe images in the preset image library, and transforms the shoe image into a shoe image from the same viewpoint based on spatial transformation parameters;
[0082] The key point acquisition module inputs the shoe image transformed to the same viewpoint into a preset pedestrian key point acquisition device, and the pedestrian key point acquisition device outputs a set of pedestrian key point coordinates for each shoe image.
[0083] The visual signaling set generation module extracts pixel blocks from the shoe image based on the coordinate set of the pedestrian key points corresponding to the shoe image, and constructs a first visual signaling corresponding to each pedestrian key point based on the extracted pixel blocks. Multiple first visual signals are input into a preset visual signaling set generator to obtain shoe features corresponding to each shoe image.
[0084] The similarity calculation module calculates the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the image library, and sorts the shoe images in the image library based on the similarity.
[0085] This solution transforms shoe images into shoe images from the same viewpoint, facilitating comparison between images. Then, key points are collected from each shoe image, pixel blocks are extracted based on these key points, and a first visual signal is constructed. Based on these key points, the most representative region of each shoe image can be collected, improving comparison efficiency. Shoe features are then obtained, which integrate multiple first visual signals. Similarity calculations are performed on these shoe features to directly obtain images that are similar to the shoe image to be matched. This solution improves comparison efficiency while further enhancing the accuracy of similarity calculation through shoe image transformation, key point collection, construction of first visual signals and shoe features, and subsequent similarity calculations.
[0086] This scheme first collects key information from the image using a pedestrian keypoint collector, then trains the feature observation signals multiple times using a visual signaling set generator, and finally obtains the feature observation signals that combine all the first visual signals as shoe features. Under the premise of ensuring that the collected key information is not distorted, the information is combined, compressed and extracted to obtain shoe features. Then, similarity is calculated based on the shoe features, ensuring the reliability of the calculation results.
[0087] This invention also provides a video pedestrian shoe type retrieval device that integrates local and global information. The device includes a computer device, which includes a processor and a memory. The memory stores computer instructions, and the processor executes the computer instructions stored in the memory. When the computer instructions are executed by the processor, the device implements the steps of the method described above.
[0088] This invention also provides a computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program implements the steps of the aforementioned video pedestrian shoe type retrieval method that integrates local and global information. The computer-readable storage medium can be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, register, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.
[0089] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this invention are programs or code segments used to perform the desired tasks. The programs or code segments can be stored in a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried in a carrier wave.
[0090] It should be clarified that the present invention is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of the present invention.
[0091] In this invention, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.
[0092] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations of the embodiments of the present invention are possible. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for retrieving the shoe type worn by pedestrians in video footage by integrating local and global information, characterized in that, The steps of the method include: Obtain the shoe image to be matched and the shoe images in the preset image library, and transform the shoe image into a shoe image from the same viewpoint based on the spatial transformation parameters; The shoe image, transformed to the same perspective, is input into a preset pedestrian key point collector, which outputs a set of pedestrian key point coordinates for each shoe image. Based on the pedestrian keypoint coordinate set corresponding to the shoe image, pixel blocks are extracted from the shoe image, and a first visual signaling corresponding to each pedestrian keypoint is constructed based on the extracted pixel blocks. Multiple first visual signals are input into a preset visual signaling set generator to obtain shoe features corresponding to each shoe image. A preset feature observation signaling is obtained, and the feature observation signaling is combined with multiple first visual signals to form a visual signaling set. The visual signaling set is input into the visual signaling set generator to obtain shoe features corresponding to each shoe image. The visual signaling generator includes multiple visual transformation encoders connected in sequence. Each time the feature observation signaling and the first visual signaling in the visual signaling set pass through a visual transformation encoder, updated feature observation signaling and first visual signaling are output. The updated feature observation signaling output by the last visual transformation encoder in the visual signaling generator is used as the shoe feature. Calculate the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the image library, and sort the shoe images in the image library based on the similarity.
2. The video pedestrian shoe type retrieval method that integrates local and global information according to claim 1, characterized in that, The steps of extracting pixel blocks from the shoe image based on the pedestrian keypoint coordinate set corresponding to the shoe image, and constructing a first visual signal based on the extracted pixel blocks from the keypoints include: A predetermined number of pixel blocks are extracted from the shoe image that are closest to each of the pedestrian key points; The pixel blocks extracted based on each pedestrian keypoint are combined into a first visual signal.
3. The video pedestrian shoe type retrieval method that integrates local and global information according to claim 1, characterized in that, The visual conversion encoder includes two layer normalization modules, one multi-head attention module, one multilayer perceptron module, and two addition modules. After the visual signaling set is input into the visual conversion encoder, it first undergoes normalization processing through the layer normalization module; then, it undergoes attention feature extraction through the multi-head attention module; then, it undergoes addition processing through the addition module to add the attention features to the original visual signaling set; then, it undergoes normalization processing through the second layer normalization module; then, it undergoes processing through the multilayer perceptron module; finally, it undergoes addition processing through the second addition module to obtain the updated feature observation signaling and the first visual signaling.
4. The method for retrieving the shoe type worn by pedestrians in a video by fusing local and global information according to any one of claims 1-3, characterized in that, In the step of calculating the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the image library, the shoe features corresponding to each shoe image in the image library are constructed into a temporary database, and the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the temporary database is calculated.
5. The video pedestrian shoe type retrieval method that integrates local and global information according to claim 4, characterized in that, In the step of calculating the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the temporary database, the similarity is calculated according to the following formula: in, A matrix representing the shoe features corresponding to the shoe image to be matched. Indicates the number in the temporary database i A matrix of shoe features, The matrix representing the shoe features corresponding to the shoe image to be matched and the matrix in the temporary database are... i The similarity of the matrix of shoe features, The transpose of the matrix representing the shoe features corresponding to the shoe image to be matched. This indicates the norm of a matrix.
6. The method for retrieving the shoe type worn by pedestrians in a video video by fusing local and global information according to claim 1, characterized in that, In the step of transforming the shoe image into a shoe image from the same viewpoint based on spatial transformation parameters, the shoe image is transformed into a shoe image from the same viewpoint according to the following formula: in, Indicates the first image in the image library i The x-coordinate of any pixel block in a shoe image before transformation. Indicates the first image in the image library i The x-coordinate of any pixel block in a shoe image after transformation. Indicates the first image in the image library i The ordinate of any pixel block in a shoe image before transformation. Indicates the first image in the image library i The transformed ordinate of any pixel block in a shoe image , , , and All of these are preset spatial conversion parameters.
7. The video pedestrian shoe type retrieval method integrating local and global information according to claim 1, characterized in that, The step of sorting shoe images in the image library based on similarity further includes obtaining a preset number of shoe images in the image library that have a high similarity to the shoe image to be matched.
8. A video-based pedestrian shoe type retrieval system that integrates local and global information, characterized in that, The system includes: The shoe image transformation module acquires the shoe image to be matched and the shoe images in the preset image library, and transforms the shoe image into a shoe image from the same viewpoint based on spatial transformation parameters; The key point acquisition module inputs the shoe image transformed to the same viewpoint into a preset pedestrian key point acquisition device, and the pedestrian key point acquisition device outputs a set of pedestrian key point coordinates for each shoe image. A visual signaling set generation module extracts pixel blocks from the shoe image based on the coordinate set of pedestrian key points corresponding to the shoe image, and constructs a first visual signaling corresponding to each pedestrian key point based on the extracted pixel blocks. Multiple first visual signals are input into a preset visual signaling set generator to obtain shoe features corresponding to each shoe image. A preset feature observation signaling is obtained, and the feature observation signaling is combined with multiple first visual signals to form a visual signaling set. The visual signaling set is input into the visual signaling set generator to obtain shoe features corresponding to each shoe image. The visual signaling generator includes multiple sequentially connected visual transformation encoders. Each time the feature observation signaling and the first visual signaling in the visual signaling set pass through a visual transformation encoder, an updated feature observation signaling and the first visual signaling are output. The updated feature observation signaling output by the last visual transformation encoder in the visual signaling generator is used as the shoe feature. The similarity calculation module calculates the similarity between the shoe features corresponding to the shoe image to be matched and the shoe features corresponding to each shoe image in the image library, and sorts the shoe images in the image library based on the similarity.