A text pedestrian representation learning and matching method and system
By combining ResNet and BERT models with multi-scale representation learning and cross-modal matching, visual and text representations are optimized, which solves the problem of insufficient combination of global and local features in existing technologies and improves the robustness and accuracy of text-image person re-identification.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TONGJI ARTIFICIAL INTELLIGENCE RES INST SUZHOU CO LTD
- Filing Date
- 2022-09-27
- Publication Date
- 2026-06-23
AI Technical Summary
Existing text-image pedestrian re-identification methods cannot effectively combine global and local features, resulting in insufficient robustness of cross-scale matching, and existing models have limited ability to extract semantic features from both visual and textual features.
A pre-trained ResNet model is used to generate primary feature maps, and self-attention is calculated using the Bottleneck Transformer. Word embeddings are learned by combining the BERT model, and visual and text representations are optimized using a cross-modal projection matching function. Features are aligned at local, intermediate, and global scales, and a multi-scale representation learning architecture is designed.
It improves the discriminative power of visual and textual features, enhances the robustness of cross-modal matching, and improves the accuracy and flexibility of retrieval.
Smart Images

Figure CN115527236B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of intelligent analysis technology for surveillance videos, and in particular to a text-person representation learning and matching method and system based on multi-scale self-attention. Background Technology
[0002] Text-image person re-identification retrieves relevant pedestrians by leveraging existing pedestrian-related queries, such as an image or a text description. Specifically, this technology sorts all pedestrian images in a large image database based on their similarity to the query and selects the most relevant image as the result. Traditional person re-identification typically uses an existing image as the retrieval query, which is not always feasible in practical applications. For example, in criminal investigations, sometimes only a witness description is provided, not an image of the suspect. As a new technology in image retrieval, text-image person re-identification can retrieve target pedestrians from an image database given a text description related to their appearance. When image data is lacking, text descriptions are more natural and readily available as retrieval queries, making text-image person re-identification highly valuable in situations where no target image is available. Furthermore, thanks to open-ended natural language queries, text-based methods are often more flexible. Considering these advantages, this technology has wide applications in daily life, such as security video surveillance and personal photo album search.
[0003] Based on the feature scale of text-image pedestrian re-identification, relevant algorithm models can be divided into the following two categories: (1) models based on global features; (2) models based on local features. Models based on global features generate features based on global information of text and image and use them directly for retrieval; models based on local features divide the image into small regions and the text into phrases or words, and find fine-grained discriminative features from the local to assist global features in retrieval.
[0004] On the other hand, a good text-image person re-identification method typically focuses on two aspects: one is enabling images and text to learn representations in a coarse-to-fine manner across all scales, and the other is allowing the model to explore adaptive multi-scale matching methods to align features at different scales. Current work fails to effectively combine these two aspects, neglecting to simultaneously combine global and local scales to generate multi-scale fused features. For multi-scale matching, existing methods attempt to align images and text descriptions at different scales using predefined rules, such as considering only global matching between the entire image and the entire description, or adding matching between text phrases and image regions. However, these methods do not consider cross-scale correlations between different modalities. Therefore, a more reasonable structure is needed that preserves global information within regions while effectively extracting local features, and simultaneously promotes cross-modal multi-scale feature matching.
[0005] Patent CN114612927A proposes a pedestrian re-identification method based on image and text dual-channel joint processing, using the text channel to assist in learning the image channel to complete the pedestrian re-identification task. However, due to the limited semantic extraction capability of visual and text features under the dual-channel framework, the robustness during the retrieval process is insufficient.
[0006] Patent CN113553947A provides a multimodal person re-identification method that combines the advantages of text descriptions and sketch images for person re-identification, and reduces the modal gap between descriptive features and image features based on generative adversarial methods. Although the addition of multimodal information helps improve retrieval performance, the model used for text feature generation in the proposed scheme is relatively simple and cannot obtain deep semantic information. Summary of the Invention
[0007] In view of this, the purpose of this application is to propose a text pedestrian representation learning and matching method and system, which can specifically solve the existing problems.
[0008] Based on the above objectives, this application proposes a text-based pedestrian representation learning and matching method, including:
[0009] 1) In the visual learning part, a pre-trained ResNet model is used to generate a primary feature map for each input image, and the primary feature map is segmented based on different scales;
[0010] 2) The segmented feature maps are used as a representation learning network to perform self-attention calculation on different visual regions;
[0011] 3) In the text representation learning part, each word embedding is learned through a pre-trained BERT model with fixed parameters;
[0012] 4) The word embeddings are further processed through a hybrid branch network that combines residual networks and Transformers, so that the text representations adaptively learn to match the corresponding visual representations.
[0013] 5) Optimize text and image representations through cross-modal projection matching functions, aligning visual and text representations at local, intermediate, and global scales respectively;
[0014] 6) During the testing phase, the combined representation of the three scales is used as the final representation for retrieval.
[0015] Furthermore, in step 1), for local scales, the PCB model strategy is used to horizontally divide the primary feature map into multiple regions; for medium scales, two adjacent horizontal regions are merged into a new region as a medium-scale feature; for global scales, the primary feature map is directly regarded as a global feature.
[0016] Furthermore, in step 2), multiple feature maps within the same scale are fused together using pooling operations to obtain the final representation of that scale, and the representations of all scales are used together as the multi-scale features extracted from the visual part.
[0017] Furthermore, in step 4), for the local and medium-scale branches, the Bottleneck structure in ResNet is used to explore the information connections between adjacent word embeddings through convolution operations, thereby learning a matching representation for the word embedding sequence; for the global scale branch, a shallow Bottleneck Transformer structure is used to extract semantic information with a large span in the text content.
[0018] Furthermore, the representations obtained from the outputs of the local scale, intermediate scale, and global scale branches are collectively used as the multi-scale features extracted from the text portion.
[0019] Furthermore, in step 5), the optimized text and image representations are embedded into a unified space, which not only narrows the distance between modalities but also further enhances the discriminative power of the features.
[0020] Furthermore, in step 6), during the testing phase, the representations of the three scales—local scale, medium scale, and global scale—generated by dual-path computation are combined as the final representation for retrieval and matching between text and images.
[0021] To achieve the above objectives, this application also proposes a text-based pedestrian representation learning and matching system, comprising:
[0022] The visual learning module is used to generate primary feature maps for each input image using a pre-trained ResNet model, and to segment the primary feature maps based on different scales.
[0023] The self-attention computation module is used to perform self-attention computation on different visual regions using the Bottleneck Transformer as a representation learning network after segmentation of the feature map.
[0024] The text representation learning module is used to learn the embedding of each word through a pre-trained BERT model with fixed parameters;
[0025] The visual representation module is used to further process word embeddings through a hybrid branch network that combines residual networks and Transformers, so that the text representations adaptively learn to match the corresponding visual representations.
[0026] The representation optimization module is used to optimize the representations of text and images through a cross-modal projection matching function, aligning visual representations and text representations at local, intermediate, and global scales, respectively.
[0027] The testing module is used to perform retrieval by using the combined representation of the three scales as the final representation.
[0028] In summary, the advantages of this application and the user experience it brings are as follows:
[0029] First, the Transformer structure is introduced in the representation learning part, which helps the model to extract better semantic information from two modalities at the same time, thereby enabling the semantic subjects between the modalities to be connected.
[0030] Second, a multi-scale representation learning architecture was designed, which can maximize the discriminative power of visual and text features from multiple different granularities and enrich the expressive power of representations at different granularities.
[0031] Third, a multi-scale cross-modal matching strategy was designed, which optimizes the representation of text and image by using the cross-modal projection matching function (CMPM). Visual representation and text representation are aligned at local, medium and global scales, respectively, thereby gradually reducing the differences between modalities and further enhancing the discriminativeness of the representation. Attached Figure Description
[0032] In the accompanying drawings, unless otherwise specified, the same reference numerals throughout the various drawings denote the same or similar parts or elements. These drawings are not necessarily drawn to scale. It should be understood that these drawings depict only some embodiments disclosed in this application and should not be construed as limiting the scope of this application.
[0033] Figure 1A flowchart illustrating a text pedestrian representation learning and matching method according to an embodiment of this application is shown.
[0034] Figure 2 This is a schematic diagram of the feature extraction and matching framework of the method of the present invention.
[0035] Figure 3 This diagram illustrates the Top-k (k=1,5,10) accuracy results of the algorithm of this invention and other algorithms on the CUHK-PEDES text-image pedestrian re-identification public dataset.
[0036] Figure 4 A diagram illustrating the structure of a text pedestrian representation learning and matching system according to an embodiment of this application is shown.
[0037] Figure 5 A schematic diagram of the structure of an electronic device provided in one embodiment of this application is shown.
[0038] Figure 6 A schematic diagram of a storage medium provided in one embodiment of this application is shown. Detailed Implementation
[0039] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the invention. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.
[0040] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.
[0041] Figure 1 A flowchart illustrating a text pedestrian representation learning and matching method according to an embodiment of this application is shown, including the following steps:
[0042] Step 1: Generate feature maps at different scales for the image. The specific steps are as follows: First, use image I as input to ResNet, extract its fourth layer residual block, and then calculate the feature map f. I ∈R H×W×C As the primary feature map of the image, where R refers to the real number domain, and H, W, and C represent the height, width, and number of channels of the feature map, respectively, R refers to the real number domain. The multi-scale visual representation learning module is divided into three branches: local scale branch, intermediate scale branch, and global scale branch. For the first two branches, the PCB model strategy is first used to process the primary feature map f. I The horizontal segmentation is divided into K local region feature maps {f I,1 ,f I,2 ,...,f I,K},in Specifically, for the mesoscale branch, every two adjacent local region feature maps are stitched together vertically to generate a set of mesoscale feature maps; while in the global branch, the model directly uses the primary feature map f I As a global feature map.
[0043] Step 2: Utilize a representation learning network to perform self-attention calculations on different visual regions. Specifically, in the local scale branch, each local feature map is directly used as input to the Bottleneck Transformer to obtain self-attention-weighted feature maps within the region. These are then max-pooled to obtain K local features. The letter 'l' indicates a local feature. In the intermediate-scale branch, the set of intermediate-scale feature maps is used as input to the Bottleneck Transformer to obtain self-attention-weighted feature maps within the stitching region. After max pooling, K-1 intermediate-scale features are obtained. The letter 'm' represents the middle characteristic. In the global branch, directly use the primary feature map f I The representations extracted at these three scales are used as input to the Bottleneck Transformer. These representations are then combined to form the visual representation for the multi-scale matching stage. The letter 'g' indicates global features.
[0044] Step 3: Learn each word embedding for the text description. Specifically, the text description is decomposed into a word sequence according to specific rules and then processed by a pre-trained tagger to obtain a tag sequence. [CLS] and [SEP] tags are inserted at the beginning and end respectively as category and segmentation tags. Then, the tag sequence is truncated or padded according to a set maximum length L to ensure that the tag sequence generated for each segment of the text description has the same length. Finally, it is fed into a fixed-parameter BERT model to generate word embeddings f. w ∈R L×D Here, D represents the dimension of the output word embedding. The word embedding is then fed into the representation learning module, which also consists of three branches: a local scale branch, a medium scale branch, and a global scale branch. Each branch is used to adapt the visual representation to the corresponding scale. Before inputting the word embedding features into each branch, the feature dimension needs to be adjusted by adding a dimension to match the input shape of the branch network. Furthermore, since the dimension D of the word embedding needs to match the number of channels C of the visual part, the features need to be passed through a 1×1 convolutional network to adjust the number of channels, and then the text features f are obtained through a batch normalization layer. t ∈R 1×L×C.
[0045] Step 4: Obtain text representations using a hybrid branch network, specifically described as follows: For the first two branches, a residual Bottleneck structure similar to that used in traditional ResNet models is stacked to form the representation learning network. Each Bottleneck structure consists of two 1×1 convolutional layers and a 1×3 convolutional layer sandwiched between them. Each convolutional layer is followed by a batch normalization layer and a ReLU activation function, and skip connections are added to suppress network degradation and accelerate the model. For the local-scale and medium-scale branches, each branch consists of three stacked Bottleneck Transformer blocks. Max pooling is added to the end of each branch, thereby generating corresponding local text representations f for these two branches respectively. t l ∈R C With medium text representation f t m ∈R C For the global scale branch, the Bottleneck Transformer is applied to the text domain using a method similar to that used in visual recognition, employing only a single-layer Transformer block to learn global text information. Max pooling is also added at the end of this branch to obtain the global text representation f. t g ∈R C .
[0046] Step 5: Within a given mini-batch of size N, the set of visual representations is defined as... Represents the representation of the i-th image. The text representation set is defined as the identity label corresponding to the i-th image. The representation of the i-th text. This represents the identity label corresponding to the i-th text. For visual representation... The set of text representations within this batch and its constituent image-text representation pairs y i,j This indicates whether the image-text representation matches the corresponding identity label. That is, when the text and image have the same corresponding identity label, y i,j =1, otherwise y i,j =0. For each image-text representation pair, the matching probability between them is obtained using the following formula:
[0047]
[0048] in This represents the text representation after regularization. The expression yields p.i,j Can be seen as and The similarity between them accounts for With all text representations in the entire batch The proportion of the sum of similarities. For and The true matching probability between samples is determined by using the normalized label distribution q, taking into account the possibility of multiple samples matching simultaneously within the same batch. i,j As the true distribution, the calculation expression is as follows:
[0049]
[0050] Will The matching loss for text representation computation can be expressed as:
[0051]
[0052] Here, ε is a very small value used to avoid numerical problems. This expression can be viewed as a distribution p. i For distribution q i The KL divergence between the two distributions is minimized to improve the model's accuracy for the predicted distribution p. i Closer to the true distribution q i This makes probability predictions more accurate. The loss is calculated sequentially for each visual representation within a small batch, and then summed to obtain the one-way matching loss between the image and the text:
[0053]
[0054] In the text part, a similar approach is used to calculate the similarity between each text representation and all visual representations within the mini-batch, and the one-way matching loss L for text to image is obtained according to the loss calculation method described above. t2i The final loss L is obtained by summing the matching losses in the two directions mentioned above.
[0055] L = L i2t +L t2i .
[0056] Final loss function L multi It is obtained by summing the feature matching losses at different scales, as expressed in the following formula:
[0057] L multi =L l +L m +L g ,
[0058] Where L l ,L m ,L gThese represent the CMPM loss calculated at local, medium, and global scales, respectively.
[0059] Step Six: In the testing phase, the combined representation of the three scales is used as the final representation for retrieval. The specific description is as follows: In the testing phase, for the visual representation, the visual representations of the above three scales are combined... The representation obtained by direct summation is used to calculate the similarity with the text representation, and the results are sorted according to the similarity to obtain the search results. For the text representation, similar to the visual part, the representations at different scales are directly summed to obtain a text representation that integrates multi-scale information, and this is used to calculate the similarity with the visual representation, thereby sorting the results to obtain the search results.
[0060] The specific implementation steps of this invention are as follows:
[0061] Figure 1 This is a flowchart of the algorithm implementation of the present invention, and the specific implementation method is as follows:
[0062] 1. Calculate the primary feature map f of image I. I ∈R H×W×C ;
[0063] 2. The primary feature map is segmented and combined according to different scales to obtain feature maps of different scales;
[0064] 3. Use the above feature maps as inputs to the corresponding branches of the Transformer to obtain visual representations at different scales.
[0065] 4. Use a pre-trained BERT model to learn the word embedding sequence f of the text description T. w ∈R L×D ;
[0066] 5. Word embedding processing to obtain text features f t ∈R 1×L×C ;
[0067] 6. Input text features into a hybrid branch network to obtain text representations at different scales.
[0068] 7. Combine the visual representations F obtained in steps 3 and 6. I With text representation F T As input to the multi-scale feature matching module, the matching loss L at different scales is calculated respectively;
[0069] 8. Sum the matching losses at different scales to obtain the final loss L. multi Return and update the model;
[0070] 9. After training is completed, the combined representation of the three scales is used as the final representation for retrieval in both modalities.
[0071] The application provides a text pedestrian representation learning and matching system, which is used to execute the text pedestrian representation learning and matching method described in the above embodiments, such as... Figure 4 As shown, the system includes:
[0072] The visual learning module 501 is used to generate a primary feature map for each input image using a pre-trained ResNet model, and to segment the primary feature map based on different scales.
[0073] Self-attention calculation module 502 is used to perform self-attention calculation on different visual regions using the Bottleneck Transformer as a representation learning network after segmentation of feature maps.
[0074] The text representation learning module 503 is used to learn the embedding of each word through a pre-trained BERT model with fixed parameters;
[0075] The visual representation module 504 is used to further process word embeddings through a hybrid branch network that combines residual networks and Transformers, so that the text representations adaptively learn to match the corresponding visual representations.
[0076] The representation optimization module 505 is used to optimize the representations of text and images through a cross-modal projection matching function, aligning visual representations and text representations at local, intermediate, and global scales, respectively.
[0077] Test module 506 is used to perform retrieval by using the combined representation of the three scales as the final representation.
[0078] The text pedestrian representation learning and matching system provided in the above embodiments of this application and the text pedestrian representation learning and matching method provided in the embodiments of this application are based on the same inventive concept and have the same beneficial effects as the methods adopted, run or implemented by the applications stored therein.
[0079] This application also provides an electronic device corresponding to the text pedestrian representation learning and matching method provided in the foregoing embodiments, for executing the above text pedestrian representation learning and matching method. This application does not limit the scope of the embodiments.
[0080] Please refer to Figure 5 This illustrates a schematic diagram of an electronic device provided by some embodiments of this application. For example... Figure 5As shown, the electronic device 20 includes: a processor 200, a memory 201, a bus 202, and a communication interface 203. The processor 200, the communication interface 203, and the memory 201 are connected via the bus 202. The memory 201 stores a computer program that can run on the processor 200. When the processor 200 runs the computer program, it executes the text pedestrian representation learning and matching method provided in any of the foregoing embodiments of this application.
[0081] The memory 201 may include high-speed random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Communication between this system network element and at least one other network element is achieved through at least one communication interface 203 (which can be wired or wireless), such as the Internet, wide area network, local area network, or metropolitan area network.
[0082] Bus 202 can be an ISA bus, PCI bus, or EISA bus, etc. The bus can be divided into an address bus, a data bus, a control bus, etc. The memory 201 is used to store programs. After receiving an execution instruction, the processor 200 executes the program. The text pedestrian representation learning and matching method disclosed in any of the foregoing embodiments of this application can be applied to the processor 200, or implemented by the processor 200.
[0083] The processor 200 may be an integrated circuit chip with signal processing capabilities. In implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of the processor 200 or by instructions in software form. The processor 200 may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it may also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor may be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules may reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. The storage medium is located in memory 201. The processor 200 reads the information in memory 201 and, in conjunction with its hardware, completes the steps of the above method.
[0084] The electronic device provided in this application embodiment and the text pedestrian representation learning and matching method provided in this application embodiment are based on the same inventive concept and have the same beneficial effects as the methods they adopt, operate or implement.
[0085] This application also provides a computer-readable storage medium corresponding to the text pedestrian representation learning and matching method provided in the foregoing embodiments. Please refer to... Figure 6 The computer-readable storage medium shown is an optical disc 30, on which a computer program (i.e., a program product) is stored. When the computer program is run by a processor, it executes the text pedestrian representation learning and matching method provided in any of the foregoing embodiments.
[0086] It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other optical and magnetic storage media, which will not be elaborated here.
[0087] The computer-readable storage medium provided in the above embodiments of this application and the text pedestrian representation learning and matching method provided in the embodiments of this application are based on the same inventive concept and have the same beneficial effects as the methods adopted, run or implemented by the applications stored therein.
[0088] It should be noted that:
[0089] The algorithms and displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used in conjunction with the teachings herein. The required structure for constructing such systems is apparent from the above description. Furthermore, this application is not directed to any particular programming language. It should be understood that the content of this application described herein can be implemented using various programming languages, and the above description of specific languages is for the purpose of disclosing the best mode of implementation of this application.
[0090] Numerous specific details are set forth in the specification provided herein. However, it will be understood that embodiments of this application may be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this specification.
[0091] Similarly, it should be understood that, in order to simplify this application and aid in understanding one or more of the various inventive aspects, in the above description of exemplary embodiments of this application, various features of this application are sometimes grouped together into a single embodiment, figure, or description thereof. However, this method of disclosure should not be construed as reflecting an intention that the claimed application requires more features than are expressly recited in each claim. Rather, as reflected in the following claims, inventive aspects lie in fewer than all features of a single foregoing disclosed embodiment. Therefore, the claims following the detailed description are hereby expressly incorporated into that detailed description, wherein each claim itself is a separate embodiment of this application.
[0092] Those skilled in the art will understand that modules in the device of the embodiments can be adaptively changed and placed in one or more devices different from that embodiment. Modules, units, or components in the embodiments can be combined into a single module, unit, or component, and further, they can be divided into multiple sub-modules, sub-units, or sub-components. Except where at least some of such features and / or processes or units are mutually exclusive, any combination can be used to combine all features disclosed in this specification (including the accompanying claims, abstract, and drawings) and all processes or units of any method or device so disclosed. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract, and drawings) may be replaced by an alternative feature that serves the same, equivalent, or similar purpose.
[0093] Furthermore, those skilled in the art will understand that although some embodiments described herein include certain features but not others included in other embodiments, combinations of features from different embodiments are intended to be within the scope of this application and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
[0094] The various component embodiments of this application can be implemented in hardware, or as software modules running on one or more processors, or a combination thereof. Those skilled in the art will understand that microprocessors or digital signal processors (DSPs) can be used in practice to implement some or all of the functions of some or all of the components in the virtual machine creation system according to the embodiments of this application. This application can also be implemented as a device or system program (e.g., a computer program and computer program product) for performing part or all of the methods described herein. Such an implementation of this application can be stored on a computer-readable medium, or can be in the form of one or more signals. Such signals can be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
[0095] It should be noted that the above embodiments are illustrative of this application and not restrictive, and that those skilled in the art can devise alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses should not be construed as limiting the claims. The word "comprising" does not exclude the presence of elements or steps not listed in the claims. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. This application can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the unit claims enumerating several systems, several of these systems may be embodied by the same item of hardware. The use of the words first, second, and third, etc., does not indicate any order. These words can be interpreted as names.
[0096] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various variations or substitutions within the technical scope disclosed in this application, and these should all be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A text-based pedestrian representation learning and matching method, characterized in that, Includes the following steps: 1) In the visual learning part, a pre-trained ResNet model is used to generate a primary feature map for each input image, and the primary feature map is segmented based on different scales; 2) The segmented feature maps are used as a representation learning network to perform self-attention calculation on different visual regions; in step 2), multiple feature maps within the same scale are fused through pooling operation to become the final representation of that scale, and the representations of all scales are used together as the multi-scale features extracted from the visual part. 3) In the text representation learning part, the embedding of each word is learned through a pre-trained BERT model with fixed parameters; 4) The word embeddings are further processed through a hybrid branch network that combines residual networks and Transformers, so that the text representations adaptively learn to match the corresponding visual representations; 5) Optimize text and image representations through cross-modal projection matching functions, aligning visual and text representations at local, intermediate, and global scales respectively; 6) During the testing phase, the combined representation of the three scales is used as the final representation for retrieval; In step 1), the primary feature map is horizontally divided into multiple regions using the PCB model strategy at the local scale; at the medium scale, two adjacent horizontal regions are merged into a new region as the medium-scale feature. For global scales, the primary feature maps are directly treated as global features; In step 4), for the local and medium-scale branches, the Bottleneck structure in ResNet is used to explore the information connections between adjacent word embeddings through convolution operations, thereby learning a matching representation for the word embedding sequence; for the global scale branch, a shallow Bottleneck Transformer structure is used to extract semantic information with a large span in the text content.
2. The text pedestrian representation learning and matching method according to claim 1, characterized in that, The representations obtained from the outputs of the local scale, intermediate scale, and global scale branches are collectively used as the multi-scale features extracted from the text.
3. The text pedestrian representation learning and matching method according to claim 1, characterized in that, In step 5), the optimized text and image representations are embedded into a unified space, which not only narrows the distance between modalities but also further enhances the discriminative power of the features.
4. The text pedestrian representation learning and matching method according to claim 1, characterized in that, In step 6), during the testing phase, the representations of the local scale, intermediate scale, and global scale generated by the dual-path calculation are combined as the final representation for retrieval and matching between text and images.
5. A text pedestrian representation learning and matching system, using the method described in any one of claims 1-4, characterized in that, include: The visual learning module is used to generate primary feature maps for each input image using a pre-trained ResNet model, and to segment the primary feature maps based on different scales. The self-attention computation module is used to perform self-attention computation on different visual regions using the Bottleneck Transformer as a representation learning network after segmentation of the feature map. The text representation learning module is used to learn the embedding of each word through a pre-trained BERT model with fixed parameters; The visual representation module is used to further process word embeddings through a hybrid branch network that combines residual networks and Transformers, so that the text representations adaptively learn to match the corresponding visual representations. The representation optimization module is used to optimize the representations of text and images through a cross-modal projection matching function, aligning visual representations and text representations at local, intermediate, and global scales, respectively. The testing module is used to perform retrieval by using the combined representation of the three scales as the final representation.
6. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, The processor executes the computer program to implement the method as described in any one of claims 1-4.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, The program is executed by a processor to implement the method as described in any one of claims 1-4.