A text person retrieval method based on bidirectional optimal transmission and multi-modal feature alignment

By employing a bidirectional optimal transmission and multimodal feature alignment method, the problem of modal semantic gap and coarse feature alignment granularity in text-based person retrieval is solved. This method achieves fine-grained cross-modal semantic alignment and strong discriminative feature learning, thereby improving retrieval accuracy and generalization ability.

CN122240859APending Publication Date: 2026-06-19ELECTRIC POWER RES INST OF STATE GRID ZHEJIANG ELECTRIC POWER COMAPNY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ELECTRIC POWER RES INST OF STATE GRID ZHEJIANG ELECTRIC POWER COMAPNY
Filing Date
2026-05-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing text-based person retrieval methods suffer from problems such as a large modal semantic gap, coarse feature alignment granularity, and weak cross-domain generalization ability, making it difficult to effectively establish a fine semantic correspondence between text descriptions and image regions.

Method used

We employ a bidirectional optimal transmission and multimodal feature alignment approach. By combining global feature alignment and local optimal feature transmission with adaptive distributed weights and consistency constraints, we achieve fine-grained cross-modal semantic alignment and highly discriminative feature learning.

Benefits of technology

It significantly improves the accuracy and generalization performance of text-based people retrieval, enabling accurate retrieval of people images in complex scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240859A_ABST
    Figure CN122240859A_ABST
Patent Text Reader

Abstract

This invention discloses a text-based people retrieval method based on bidirectional optimal transfer and multimodal feature alignment. The method includes: using text descriptions and visible light images as input to the model, preprocessing the text and visible light images separately; extracting global features and local token features of the text using a pre-trained CLIP text encoder, and extracting global features and local patch features of the image using a pre-trained CLIP image encoder; global feature alignment, constraining the model to learn highly discriminative identity features through an identity classification loss function; achieving fine-grained cross-modal semantic alignment through optimal local feature transfer in both text-to-image and image-to-text directions, and introducing adaptive distributed weights; and jointly optimizing the entire model through bidirectional consistency constraints, identity loss, and bidirectional optimal transfer contrastive loss functions. This invention significantly improves the accuracy and generalization performance of text-based people retrieval.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision and deep learning technology, and in particular, it is a text-based person retrieval method based on bidirectional optimal transmission and multimodal feature alignment. Background Technology

[0002] Text-based personnel retrieval technology aims to retrieve individuals with specific identities from image databases based on text descriptions, and has significant application value in fields such as security monitoring and intelligent retrieval. However, existing methods face the following technical challenges:

[0003] 1. Modal gap problem: Text and images belong to different modalities, and their feature distributions differ significantly, so direct feature matching has limited effectiveness; 2. Single alignment granularity: Most methods only perform global feature alignment, ignoring the semantic correspondence of local fine-grained features; 3. Insufficient feature discrimination: When people look similar and the background is complex, the model has difficulty learning highly discriminative identity features; 4. Limited generalization ability: The model performance drops significantly when deployed across different scenarios and cameras.

[0004] Traditional methods typically employ simple feature concatenation or attention mechanisms for modality fusion, but these methods struggle to effectively establish fine-grained semantic correspondences between text descriptions and image regions. Optimal transfer theory provides the mathematical foundation for cross-modal alignment, but existing methods based on optimal transfer often employ unidirectional transfer (e.g., only from text to image), lacking symmetry and bidirectional consistency constraints. This can lead to unstable alignment results and fail to fully exploit complementary information between modalities. Summary of the Invention

[0005] The technical problem to be solved by this invention is to overcome the problems of insufficient modal alignment, weak feature discrimination and poor generalization ability of existing text-based person retrieval methods. It provides a text-based person retrieval method based on bidirectional optimal transmission and multimodal feature alignment. Through bidirectional local feature optimal transmission and multi-level loss constraints, it achieves fine cross-modal semantic alignment and strong discriminative feature learning, which significantly improves the accuracy and generalization performance of text-based person retrieval.

[0006] Therefore, the present invention adopts the following technical solution: a text-based personnel retrieval method based on bidirectional optimal transmission and multimodal feature alignment, comprising the following steps: 1) Use text descriptions and visible light images as input to the model, and preprocess the text and visible light images separately; 2) Use the pre-trained CLIP text encoder to extract global features and local token features of the text, and use the pre-trained CLIP image encoder to extract global features and local patch features of the image; 3) Global feature alignment: The model learns highly discriminative identity features by constraining the identity loss function; 4) Fine-grained cross-modal semantic alignment is achieved through optimal local feature transfer in both text-to-image and image-to-text directions, and adaptive distributed weights are introduced; 5) Joint optimization and loss calculation: The entire model is jointly optimized through bidirectional consistency constraints, identity loss, and bidirectional optimal transmission comparison loss function; 6) During the model inference stage, for the query text and library images, extract their features and calculate the similarity. Then, output the most matching people by sorting them to complete the retrieval task.

[0007] To address the problems of large modal semantic gaps, coarse feature alignment granularity, and weak cross-domain generalization ability in existing text-based personnel retrieval methods, this invention innovatively proposes a multimodal feature alignment scheme based on bidirectional optimal transmission: fine-grained cross-modal semantic correspondence is established through bidirectional local optimal feature transmission between text and image, effectively bridging the modal gap; a multi-level framework combining global identity feature alignment and local bidirectional optimal transmission alignment is adopted to achieve comprehensive feature matching from the global to the local; and adaptive distributed weight calculation and consistency constraint mechanisms are combined to enhance feature discriminative power while improving the model's generalization ability in unknown scenarios, significantly improving the accuracy and robustness of personnel retrieval.

[0008] Furthermore, the specific content of step 1) includes: Step 11), input for each training sample is ,in For text description; Visible light image, H , W and C These represent the image's height, width, and number of channels, respectively. y For personnel identification tags; Step 12) Text preprocessing includes word segmentation, padding, and encoding to convert the text into a token sequence; visible light image preprocessing includes resizing, normalization to scale pixel values ​​from [0,255] to [-1,1], and standardization using the mean and standard deviation of ImageNet.

[0009] Furthermore, the specific steps for extracting global and local token features of the text using the pre-trained CLIP text encoder are as follows: In length of LAdd the actual token to the text token sequence [ SOS ] and the closing tag [ EOS ], change the sequence The text is input into the CLIP text encoder to extract global and local token features. in, For global text features, For local text features, Indicates the initial marker feature, The symbols used to represent the set of real numbers, d For feature dimensions.

[0010] Furthermore, the specific steps for extracting global and local patch features of the image using the pre-trained CLIP image encoder are as follows: First, the visible light image Divide into fixed sizes P The sequence of non-overlapping patch elements has a sequence length of [number missing]. These tiles are then mapped into a one-dimensional vector sequence through a linear layer. Subsequently, positional encoding and category tags are added to the sequence. CLS ], the processed sequence The input is fed into the CLIP image encoder to obtain the global and local features of the image, where, For global image features, For local image features, d For feature dimensions.

[0011] Furthermore, in step 3), the formula for calculating the identity loss function is as follows:

[0012] in, For the loss of identity, For the first i Global text features of each sample For the first i Global features of an image for each sample For the first i The identity label of each sample For a linear classification layer, softmax It is a non-linear activation function. B The size of a training batch of data.

[0013] Furthermore, step 4) specifically includes: 41) Calculate the text distribution weight of each text local feature and the image distribution weight of each image local feature; 42) Calculate the semantic distance between the text token and the image patch. Combine the text and image distribution weights to transform the local feature alignment problem of text and image into solving the text→image transmission plan, and obtain the optimal text→image transmission plan. 43) Calculate the semantic distance between the image patch and the text token. Combine the weight distribution of text and image to transform the local feature alignment problem of image and text into solving the image → text transmission plan, and obtain the optimal image → text transmission plan. 44) Calculate the optimal bidirectional transmission distance based on the two optimal transmission plans mentioned above.

[0014] Furthermore, in step 42), the problem of aligning local features of text and image is transformed into solving the text→image transmission plan under the following constraints:

[0015] In the formula, This represents the semantic distance between the text token and the image patch. This indicates the text-to-image transmission plan. Represents the weights of the text distribution. Represents the image distribution weights. This represents the set of all transmission plans that satisfy the constraints. For the total transmission cost, This is the entropy regularization term; This is the entropy regularization coefficient, used to control the smoothness of the transmission plan; Text to Image Transfer Plan The sum of each line equals the corresponding text distribution weight, i.e. The sum of each column equals the corresponding image distribution weight, i.e. And all elements are non-negative .

[0016] Furthermore, in step 42), the Sinkhorn algorithm is used to solve the constraints, as follows: Initialize the kernel matrix calculation Initialize the scaling vector Iterative updates

[0017]

[0018] when and Stop at the appropriate time and calculate the optimal transmission plan. , Indicates tolerance (also known as convergence threshold): .

[0019] Furthermore, in step 5), the mathematical consistency of the transmission plans in both directions is ensured through bidirectional consistency constraints, which are as follows:

[0020] in, This represents the consistency loss constraint. It is the Frobenius norm, ideally. ; For text-image pairs ,

[0021]

[0022] in, It is the temperature coefficient. B For training batch size, Indicates the optimal bidirectional transmission distance. Indicates the similarity of text and image features. This represents the bidirectional optimal transmission comparison loss function; Total loss for: , In the formula, This is the identity loss function.

[0023] Furthermore, in step 6), for the query text... Heku Images Features are extracted using CLIP text encoder and image encoder respectively, and the global similarity between the query text and each library image is calculated:

[0024] In the formula, For global text features, For the first k Global image features of each sample; By similarity Sort the images in descending order and output the top ten most matching images to complete the retrieval task.

[0025] This invention employs a bidirectional optimal transmission alignment mechanism, achieving fine-grained cross-modal semantic alignment through optimal local feature transmission in both text-to-image and image-to-text directions. Simultaneously, it designs an adaptive distributed weight calculation method and bidirectional consistency constraints to ensure the rationality and stability of the transmission process. Furthermore, this invention combines global identity classification loss and local optimal transmission contrast loss to construct a multi-level feature alignment framework, enabling the model to simultaneously learn highly discriminative identity features and robust cross-modal matching capabilities. This invention significantly improves the accuracy and generalization performance of text-based person retrieval. Attached Figure Description

[0026] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0027] Figure 1 This is a flowchart of a text-based personnel retrieval method based on bidirectional optimal transmission and multimodal feature alignment according to the present invention; Figure 2 This is a schematic diagram of the architecture of a text-based personnel retrieval method based on bidirectional optimal transmission and multimodal feature alignment according to the present invention. Detailed Implementation

[0028] The technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0029] As shown in Figure 1 and Figure 2 As shown, this invention is a text-based people retrieval method based on bidirectional optimal transmission and multimodal feature alignment, the steps of which are as follows: Step 1: Multimodal data input and preprocessing: The text description and visible light image are used as input. The text is segmented and encoded, and the image is normalized and standardized.

[0030] First, the input for each training sample is... ,in, For text description, For the input image ( H , W and C (representing height, width, and number of channels respectively). yFor personnel identification tags.

[0031] Then, text preprocessing includes word segmentation, padding, and encoding to convert the text into a sequence of tokens; image preprocessing includes resizing, normalization to scale pixel values ​​from [0,255] to [-1,1], and standardization using the mean and standard deviation of ImageNet.

[0032] Step 2, Deep Feature Extraction: Use the pre-trained CLIP text encoder to extract global and local token features of the text, and use the pre-trained CLIP image encoder to extract global and local patch features of the image.

[0033] First, text features are extracted using the CLIP text encoder. (The text is incomplete and appears to be a fragment of a larger document.) L Add the actual token to the text token sequence [ SOS ] and the closing tag [ EOS ], change the sequence The input is fed into a text encoder, which ultimately yields the global text features. Local text features ,in d For feature dimensions.

[0034] Secondly, the CLIP image encoder is used to extract image features. First, the input image... Divide into fixed sizes P The sequence of non-overlapping patch elements has a sequence length of [number missing]. These tiles are then mapped into a one-dimensional vector sequence through a linear layer. Then, positional encoding and category tags are added to the sequence. CLS ], the processed sequence The data is input into the image encoder. Finally, For global image features, For local image features, where d For feature dimensions.

[0035] Step 3: Global feature alignment. The model is constrained by the identity classification loss function to learn highly discriminative identity features.

[0036] Loss of identity The calculation formula is as follows:

[0037] in, For the first i Global text features of each sample For the first i Global features of an image for each sample For the firsti The identity label of each sample For a linear classification layer, softmax It is a non-linear activation function. B The size of a training batch of data.

[0038] Step 4: Bidirectional optimal transmission alignment. Fine-grained cross-modal semantic alignment is achieved through optimal local feature transmission in both the text-to-image and image-to-text directions, and adaptive distributed weights are introduced.

[0039] First, adaptive distribution weight calculation, text distribution weight, and calculation of local features for each text. :

[0040] in, It is a global feature of the text. It is cosine similarity.

[0041] Image distribution weights are used to calculate local features for each image. :

[0042] Subsequently, for optimal text-image transmission, the semantic distance between the text token and the image patch is calculated:

[0043] in, It is a cost matrix, each element This represents the semantic distance between the k-th text token and the m-th image patch, with a distance range of [0, 2], where 0 indicates complete similarity and 2 indicates complete opposites.

[0044] The problem of aligning local features of text images is transformed into solving a transport plan. , making

[0045] The constraints include the fact that the sum of each row in the transmission plan equals the corresponding text distribution weight. The sum of each column equals the corresponding image distribution weight. And all elements are non-negative This ensures that semantic quality remains conserved during transmission. This represents the set of all transmission plans that satisfy the above constraints. For the total transmission cost, For entropy regularization, The entropy regularization coefficient controls the smoothness of the transmission plan.

[0046] Solve using the Sinkhorn algorithm, and initialize and compute the kernel matrix. Initialize the scaling vector Iterative updates:

[0047]

[0048] when and Stop at time Indicate the tolerance (also known as the convergence threshold) and calculate the optimal transmission plan:

[0049] Similarly, for optimal image-text transmission, the semantic distance between the image patch and the text token is calculated:

[0050] The problem of aligning local features between images and text is transformed into solving a transport plan. , making

[0051] Similar to optimal text-image transmission, the Sinkhorn algorithm is used to solve the problem. See Table 1.

[0052] Table 1. Flowchart of the Sinkhorn algorithm for solving the optimal transmission plan

[0053] Finally, the optimal bidirectional transmission distance is calculated:

[0054]

[0055]

[0056] Step 5: Joint optimization and loss calculation. The entire model is jointly optimized through bidirectional consistency constraints, identity loss, and bidirectional optimal transmission comparison loss function.

[0057] First, the bidirectional consistency constraint ensures the mathematical consistency of the transmission plans in both directions.

[0058] in, It is the Frobenius norm, ideally. This constraint ensures the symmetry of bidirectional transmission.

[0059] Secondly, for text-image pairs ,

[0060]

[0061] in, It is the temperature coefficient. B For training batch size, Indicates the optimal bidirectional transmission distance. Indicates the similarity of text and image features. This represents the bidirectional optimal transmission comparison loss function; Total loss for: .

[0062] Step Six: In the reasoning stage, for the query text and library images, extract their features and calculate the similarity. Then, sort and output the most matching people to complete the retrieval task.

[0063] During the reasoning phase, for the query text Heku Images We extracted features using the CLIP text encoder and image encoder, and calculated the global similarity between the query text and each library image.

[0064] By similarity Sort the images in descending order and output the top ten most matching images to complete the retrieval task.

[0065] This invention proposes a text-based people retrieval method based on bidirectional optimal transmission and multimodal feature alignment, which has been validated in the field of text-based people retrieval. This invention utilizes commonly used public datasets for text-based people retrieval tasks: CUHK-PEDES, RSTPReid, and ICFG-PEDES. These datasets contain a large number of real-world images of people and their corresponding text descriptions.

[0066] The experimental setup is as follows: 1. The feature extraction backbone network uses a pre-trained CLIP-ViT / B-16 model as both the text encoder and image encoder, with a feature dimension of 512. 2. Training configuration: batch size... B =64, using the Adam optimizer, with an initial learning rate of 0.0001 and an L2 regularization weight decay factor of 1e. 6. The total number of training rounds is 60. The learning rate is increased linearly in the first 10 rounds for warm-up, and then decreased to one-tenth of its original value in the 30th and 50th rounds. 3. Key hyperparameter settings: Entropy regularization coefficient in bidirectional optimal transmission. Setting it to 0.1, the Sinkhorn algorithm tolerance is 0.01, and the temperature coefficient in the loss function... It is 0.05.

[0067] This invention uses mean accuracy (mAP) and Rank-1, Rank-5, and Rank-10 accuracy as core evaluation metrics. The performance of this method was compared with several mainstream methods on the CUHK-PEDES dataset, and the results are shown in the table below: Table 2: Performance evaluation (%) on the CUHK-PEDES dataset

[0068] Experimental results show that the proposed method achieves optimal performance across all key metrics, significantly outperforming comparative methods. This demonstrates that fine-grained cross-modal alignment achieved through bidirectional optimal transmission can effectively bridge the semantic gap, improve the model's ability to accurately retrieve images of people based on textual descriptions in complex scenarios, and exhibits good accuracy and generalization.

[0069] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

[0070] This specification and accompanying drawings are merely illustrative examples of the present invention and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present invention. Clearly, those skilled in the art can make various alterations and modifications to the present invention without departing from its scope. Therefore, if such modifications and variations fall within the scope of the present invention and its equivalents, the present invention intends to include these modifications and variations.

Claims

1. A text-based person retrieval method based on bidirectional optimal transmission and multimodal feature alignment, characterized in that, Including the following steps: 1) Use text descriptions and visible light images as input to the model, and preprocess the text and visible light images separately; 2) Use the pre-trained CLIP text encoder to extract global features and local token features of the text, and use the pre-trained CLIP image encoder to extract global features and local patch features of the image; 3) Global feature alignment: The model learns highly discriminative identity features by constraining the identity loss function; 4) Fine-grained cross-modal semantic alignment is achieved through optimal local feature transfer in both text-to-image and image-to-text directions, and adaptive distributed weights are introduced; 5) Joint optimization and loss calculation: The entire model is jointly optimized through bidirectional consistency constraints, identity loss, and bidirectional optimal transmission comparison loss function; 6) During the model inference stage, for the query text and library images, extract their features and calculate the similarity. Then, output the most matching people by sorting them to complete the retrieval task.

2. The text-based personnel retrieval method based on bidirectional optimal transmission and multimodal feature alignment according to claim 1, characterized in that, Step 1) includes the following: Step 11), input for each training sample is ,in For text description; Visible light image, H , W and C These represent the image's height, width, and number of channels, respectively. y For personnel identification tags; Step 12) Text preprocessing includes word segmentation, padding, and encoding to convert the text into a token sequence; visible light image preprocessing includes resizing, normalization to scale pixel values ​​from [0,255] to [-1,1], and standardization using the mean and standard deviation of ImageNet.

3. The text-based personnel retrieval method based on bidirectional optimal transmission and multimodal feature alignment according to claim 1, characterized in that, The steps for extracting global and local token features of the text using the pre-trained CLIP text encoder are as follows: In length of L Add a start marker to the text token sequence [ SOS ] and the closing tag [ EOS ], change the sequence The text is input into the CLIP text encoder to extract global and local token features. in, For global text features, For local text features, Indicates the initial marker feature, The symbols used to represent the set of real numbers, d For feature dimensions.

4. The text-based personnel retrieval method based on bidirectional optimal transmission and multimodal feature alignment according to claim 1, characterized in that, The steps for extracting global and local patch features of an image using a pre-trained CLIP image encoder are as follows: First, the visible light image Divide into fixed sizes P The sequence of non-overlapping patch elements has a sequence length of [number missing]. These tiles are then mapped into a one-dimensional vector sequence through a linear layer. ; Subsequently, positional encoding and category tags are added to the sequence. CLS ], the processed sequence The input is fed into the CLIP image encoder to obtain the global and local features of the image, where, For global image features, For local image features, d For feature dimensions.

5. The text-based personnel retrieval method based on bidirectional optimal transmission and multimodal feature alignment according to claim 1, characterized in that, In step 3), the formula for calculating the identity loss function is as follows: in, For the loss of identity, For the first i Global text features of each sample For the first i Global features of an image for each sample For the first i The identity label of each sample For a linear classification layer, softmax It is a non-linear activation function. B The size of a training batch of data.

6. The text-based person retrieval method based on bidirectional optimal transmission and multimodal feature alignment according to claim 1, characterized in that, Step 4) includes the following: 41) Calculate the text distribution weight of each text local feature and the image distribution weight of each image local feature; 42) Calculate the semantic distance between the text token and the image patch. Combine the text and image distribution weights to transform the local feature alignment problem of text and image into solving the text→image transmission plan, and obtain the optimal text→image transmission plan. 43) Calculate the semantic distance between the image patch and the text token. Combine the weight distribution of text and image to transform the local feature alignment problem of image and text into solving the image → text transmission plan, and obtain the optimal image → text transmission plan. 44) Calculate the optimal bidirectional transmission distance based on the two optimal transmission plans mentioned above.

7. The text-based personnel retrieval method based on bidirectional optimal transmission and multimodal feature alignment according to claim 6, characterized in that, In step 42), the problem of aligning local features of text and image is transformed into solving the text→image transmission plan under the following constraints: In the formula, This represents the semantic distance between the text token and the image patch. This indicates the text-to-image transmission plan. Represents the weights of the text distribution. Represents the image distribution weights. This represents the set of all transmission plans that satisfy the constraints. For the total transmission cost, This is the entropy regularization term; This is the entropy regularization coefficient, used to control the smoothness of the transmission plan; Text to Image Transfer Plan The sum of each line equals the corresponding text distribution weight, i.e. The sum of each column equals the corresponding image distribution weight, i.e. And all elements are non-negative .

8. The text-based personnel retrieval method based on bidirectional optimal transmission and multimodal feature alignment according to claim 7, characterized in that, In step 42), the Sinkhorn algorithm is used to solve the constraints. The steps are as follows: Initialize the kernel matrix calculation Initialize the scaling vector Iterative updates when and Stop at the appropriate time and calculate the optimal transmission plan. , Indicates tolerance: 。 9. The text-based personnel retrieval method based on bidirectional optimal transmission and multimodal feature alignment according to claim 8, characterized in that, In step 5), the mathematical consistency of the transmission plans in both directions is ensured through bidirectional consistency constraints, which are as follows: in, This represents the consistency loss constraint. It is the Frobenius norm, ideally. ; For text-image pairs , in, It is the temperature coefficient. B For training batch size, Indicates the optimal bidirectional transmission distance. Indicates the similarity of text and image features. This represents the bidirectional optimal transmission comparison loss function; Total loss for: , In the formula, This is the identity loss function.

10. The text-based person retrieval method based on bidirectional optimal transmission and multimodal feature alignment according to claim 1, characterized in that, In step 6), for the query text Heku Images Features are extracted using CLIP text encoder and image encoder respectively, and the global similarity between the query text and each library image is calculated: In the formula, For global text features, For the first k Global image features of each sample; By similarity Sort the images in descending order and output the top ten most matching images to complete the retrieval task.