A weakly supervised sign language representation understanding method and device based on spatial relationship enhancement
By constructing an explicit spatial transformation matrix to adjust the spatial score difference, the problem of distinguishing neighboring objects and aligning spatial relationships in the WREC model is solved, and high-precision weakly supervised reference expression understanding is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-19
AI Technical Summary
Existing weakly supervised referential expression understanding (WREC) models struggle to distinguish spatial differences between neighboring objects and fail to establish accurate spatial relationship text-image alignment, leading to localization failures in scenarios such as object stacking.
An explicit spatial transformation matrix is constructed by parsing the spatial relationships in the referential expression. A nonlinear transformation function is used to adjust the spatial score difference. Combined with text features and selected target fusion sub-features, accurate marking of the target location is achieved.
It significantly enhances the spatial differentiation of nearby objects, improves the accuracy of target localization, and reduces the annotation cost in practical applications.
Smart Images

Figure CN122244845A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a weakly supervised index expression understanding method and apparatus based on spatial relationship enhancement. Background Technology
[0002] Referencing Expression Comprehension (REC) is a core task at the intersection of computer vision and natural language processing. Its goal is to accurately locate the target object in an image based on a given natural language description, thereby achieving an effective association between linguistic and visual information.
[0003] Currently, reference expression understanding (REC) demonstrates significant value in practical applications such as human-computer interaction, intelligent navigation, and human-computer collaboration. However, existing supervised REC methods heavily rely on manually annotated bounding boxes. In a typical example, each language description is usually accompanied by a precisely corresponding target region annotation. This type of fully supervised annotation is not only costly and time-consuming but also often difficult to obtain in real-world scenarios, limiting the scalability and practical application potential of REC technology. To alleviate the reliance on costly manual annotation, research on weakly supervised reference expression understanding has rapidly emerged in recent years, showing broad application prospects. The training data for weakly supervised reference expression understanding only includes image and text pairs.
[0004] Spatial relationships are an indispensable key clue in the denotative expression of existing technologies. In the denotative expression of everyday scenarios, people often clarify their goals through spatial relationships. However, the weakly supervised denotative expression understanding of existing technologies is insufficient in understanding the spatial relationships in denotative expressions. Summary of the Invention
[0005] In view of this, embodiments of the present invention provide a weakly supervised index expression understanding method based on spatial relationship enhancement, in order to eliminate or improve one or more defects existing in the prior art.
[0006] One aspect of the present invention provides a weakly supervised indexing expression understanding method based on spatial relationship enhancement, the method comprising the steps of: Obtain the referential text, extract the triples of the referential text using a large language model, the triples include spatial relations, and construct a spatial transformation matrix based on the spatial relations; The input image is acquired, and the input image is encoded using a visual encoder to obtain the main visual features. The dynamic weights of each auxiliary encoder are calculated based on the main visual features. An auxiliary encoder is selected based on the dynamic weights of the auxiliary encoders. Auxiliary features are calculated based on the auxiliary encoders. The fusion features are calculated based on the main visual features and the auxiliary features. Based on the spatial transformation matrix, the fusion sub-features in the fusion features are filtered to obtain the cropping features; The text is encoded using a text encoder to obtain text features. Based on the text features, the fusion sub-features in the cropping features are filtered to obtain the target fusion sub-features. The target fusion sub-features are input into a preset prediction head, which outputs a bounding box. The input image is then labeled based on the bounding box.
[0007] Both the prediction head and the detection head use the prediction head of the YOLOv3 target detector.
[0008] This approach addresses the challenges of existing weakly supervised referential expression understanding (WREC) models in distinguishing spatial differences between neighboring objects and accurately aligning text and images based on spatial relationships. Instead of relying on implicit spatial relationship semantic features, this approach constructs an explicit spatial transformation matrix by parsing spatial relationships (such as "top left" and "right side") within the referential expression. This matrix reshapes the initial spatial scores using a nonlinear transformation function (such as the sigmoid function). By adjusting the slope and center value parameters, it increases the score difference between regions that conform to the spatial relationship and their opposite regions, significantly enhancing the spatial discriminability of neighboring objects. Finally, it combines text features and selected target fusion sub-features to determine the bounding box location, completing the labeling process. This approach enables accurate labeling of target locations based on the spatial relationships within the referential expression.
[0009] In some embodiments of the present invention, the spatial relationship is used to express a directional positional relationship. In the step of constructing a spatial transformation matrix based on the spatial relationship, the size of the spatial transformation matrix is determined based on the image size of the input image, the initial position of the spatial transformation matrix is determined based on the spatial relationship, the matrix value of the matrix point at the initial position of the spatial transformation matrix is assigned to 0, and the Euclidean distance between the remaining matrix points and the initial position is calculated. The matrix value of the remaining matrix points is determined based on the calculated Euclidean distance to obtain the spatial transformation matrix.
[0010] In some embodiments of the present invention, in the step of determining the matrix values of the remaining matrix points based on the calculated Euclidean distance, the matrix value is calculated as 1 - the calculated Euclidean distance.
[0011] In some embodiments of the present invention, when the spatial relationship is "top left", the matrix value of each remaining matrix point is calculated using the following formula: in, Indicates coordinates as The matrix values of the matrix points, Indicates the height of the input image. This indicates the width of the input image.
[0012] In some embodiments of the present invention, the step of constructing a spatial transformation matrix based on the spatial relationship further includes optimizing the spatial transformation matrix using an optimization function, wherein the optimization function uses the following calculation formula: in, The coordinates in the spatial transformation matrix are The optimized values of the matrix points The coordinates in the spatial transformation matrix are The values of the matrix points before optimization. and All of these are preset calculation parameters.
[0013] In some embodiments of the present invention, in the step of calculating the dynamic weights of each auxiliary encoder based on the main visual features and selecting the auxiliary encoder based on the dynamic weights of the auxiliary encoders, the auxiliary encoder with the largest dynamic weight is selected, and the dynamic weights of the auxiliary encoders are calculated using the following formula: Where r represents the dynamic weights of the auxiliary encoder, and the size of the input image corresponds to the size of the main visual features. Indicates the height of the input image. This indicates the width of the input image. This represents the computational weight parameters corresponding to the currently computed auxiliary encoder. This represents the calculated bias parameters corresponding to the currently calculated auxiliary encoder. Indicating the main visual features The first sub-feature of the location, This represents the total number of the first sub-features of the main visual feature.
[0014] In some embodiments of the present invention, in the step of calculating the fusion feature based on the main visual feature and the auxiliary feature, the fusion feature is calculated using the following formula: in, Indicates fusion characteristics, Indicates auxiliary features, Indicates the main visual features, and These represent the dynamic weight parameters of the corresponding auxiliary features and the main visual features, respectively.
[0015] In some embodiments of the present invention, the matrix points of the spatial transformation matrix correspond one-to-one with the fusion sub-features of the fusion feature. In the step of filtering the fusion sub-features in the fusion feature based on the spatial transformation matrix to obtain the trimmed feature, the fusion sub-features with a larger preset proportion are filtered based on the matrix values of the matrix points of the spatial transformation matrix, and the filtered fusion sub-features are combined as the trimmed feature.
[0016] In some embodiments of the present invention, the step of filtering the fusion sub-features in the fusion features based on the spatial transformation matrix to obtain the clipping features further includes: calculating the confidence level of the fusion sub-features based on the detection head; calculating the filtering value based on the matrix value of the fusion sub-features at the matrix points of the spatial transformation matrix and the confidence level of the fusion sub-features; filtering the fusion sub-features based on the filtering value; and combining the filtered fusion sub-features as clipping features.
[0017] In some embodiments of the present invention, in the step of filtering the fusion sub-features in the cropping features based on the text features to obtain the target fusion sub-features, the similarity of each fusion sub-feature is calculated using the following formula: in, Indicates the similarity of the fused sub-features. express The feature value of any one dimension.
[0018] In some embodiments of the present invention, the method further includes model pre-training. In the model pre-training step, corresponding bounding boxes are obtained based on each piece of training data in a preset training data set. InfoNCE contrast loss, cross-modal contrast alignment loss, and spatial contrast loss are calculated based on the entire training data set. The sum of InfoNCE contrast loss, cross-modal contrast alignment loss, and spatial contrast loss is calculated as the total loss. Model training is performed based on the total loss.
[0019] A second aspect of the present invention also provides a weakly supervised instruction expression device based on spatial relationship enhancement. The device includes a computer device, the computer device including a processor and a memory, the memory storing computer instructions, and the processor executing the computer instructions stored in the memory. When the computer instructions are executed by the processor, the device implements the steps of the method described above.
[0020] A third aspect of the present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the aforementioned weakly supervised instruction expression understanding method based on spatial relation enhancement.
[0021] Additional advantages, objects, and features of the invention will be set forth in part in the description which follows, and will also become apparent in part to those skilled in the art upon studying the text, or may be learned by practice of the invention. The objects and other advantages of the invention will become apparent from the description and the accompanying drawings.
[0022] Those skilled in the art will understand that the objectives and advantages achievable with the present invention are not limited to those specifically described above, and that the above and other objectives achievable with the present invention will become clearer from the following detailed description. Attached Figure Description
[0023] The accompanying drawings, which are provided to further illustrate the invention and form part of this application, are not intended to limit the scope of the invention.
[0024] Figure 1 This is a schematic diagram of one implementation of the weakly supervised index expression understanding method based on spatial relationship enhancement in this scheme; Figure 2 This provides a comparison example between the localization results of the existing WREC method and the ground truth (GT). Figure 3 A framework diagram of existing technology; Figure 4 This is a schematic diagram of the overall architecture of this solution; Figure 5 This is a schematic diagram of the experimental results of this scheme. Detailed Implementation
[0025] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and descriptions of this invention are used to explain the invention, but are not intended to limit the invention.
[0026] It should also be noted that, in order to avoid obscuring the invention with unnecessary details, only the structures and / or processing steps closely related to the solution according to the invention are shown in the accompanying drawings, while other details that are not closely related to the invention are omitted.
[0027] Figure 2 This section presents a comparison example between the localization results of the existing WREC method and the ground truth (GT) annotations. The red bounding box represents the model's predicted localization result, and the yellow bounding box represents the true target location. Figure 2It is evident that existing methods fail to effectively understand spatial relationships. For example, in a scene of stacked donuts, the model might misplace "the donut with the sugar cube in the top left corner" as an adjacent donut; in a scene of dense vegetation, it cannot distinguish "the fern in the top right corner" from other plants of the same type in other areas. This demonstrates that these methods are insufficient in understanding spatial relationships within referential expressions. However, spatial relationships are indeed indispensable clues in referential expressions. In everyday scenarios, people often use spatial relationships to clearly identify targets (such as "the cup on the right side of the table" or "the elephant in the middle of the picture"). Therefore, how to enable models to accurately parse spatial relationships in referential expressions and, based on these relationships, achieve precise localization of target objects has become a core problem that urgently needs to be solved in the task of understanding weakly supervised referential expressions.
[0028] Introduction to existing technologies: Existing technology framework diagram as follows Figure 3 As shown. The input denotative expression is "A man on a brown horse," and text features are obtained using an LSTM model. The input image's primary visual features are obtained using a DarkNet model. Auxiliary visual features are obtained using other auxiliary visual feature extractors, such as the SAM-ViT, CLIP-CNN, and DINO-ViT models. The weights corresponding to these auxiliary visual features are then determined, and the auxiliary feature with the highest weight is selected and weighted together with the primary visual feature to obtain the final visual feature.
[0029] Next, the anchor-text matching and prediction stage begins. Specifically, a similarity score is calculated between the fused visual and textual features, and a matching function is used to evaluate the degree of matching between each candidate anchor and the referential expression. The anchor with the highest similarity score is selected as the target referent, and then the head is used for detection. The anchor point is decoded into specific bounding box coordinates.
[0030] To optimize network parameters under weak supervision, this scheme constructs a composite loss function, which consists of the following four parts: 1. Anchor-Text Contrast Loss: As a basic weakly supervised learning signal, this loss aims to maximize the similarity between the target anchor and the corresponding text features, while minimizing the similarity with non-target text. It is calculated using a contrast loss function in the form of InfoNCE.
[0031] 2. Diversity-aware routing loss: The entropy value is calculated based on the weight distribution generated by the router. This loss encourages a more even distribution of routing probabilities among the available auxiliary visual encoders, preventing the overuse of a single auxiliary encoder during training, thereby balancing expert load and improving the diversity of routes.
[0032] Text-guided alignment loss: This is part of the routing feature alignment module. It uses a multilayer perceptron (MLP) to project fused visual features onto the same dimension as the text features, and calculates the mean squared error (MSE) between the projected visual and text features. This loss aims to enrich visual semantics with textual information, enhancing the model's understanding of fine-grained representations (such as attributes and colors).
[0033] Visual-guided alignment loss: This is another part of the routing feature alignment module. Employing a contrastive learning mechanism, it projects the fused visual features and the primary visual features separately into a shared embedding space. This encourages the primary visual features to approximate the semantic representation of the fused features while maintaining discriminability with other samples. This loss distills the visual knowledge of the combined features back into the basic features, improving the representational power of the basic encoder.
[0034] The total loss function is defined as the weighted sum of the four losses mentioned above.
[0035] During training, an optimizer (such as Adam) is used to train the network through end-to-end backpropagation. During the inference phase, the visual conditional router dynamically generates weights based on the main visual features of the input image, activating only the DarkNet model and the auxiliary visual encoder with the highest weight for forward propagation.
[0036] Disadvantages of existing technology: 1. Existing models have difficulty distinguishing spatial differences between neighboring objects based on spatial relationships.
[0037] Most existing models do not establish explicit spatial relationship models, but instead directly perform global feature matching using different entity features in text and images. During feature matching, models tend to capture global semantic features, ignoring fine-grained semantic features such as space. While some models attempt to model spatial relationships, the generated spatial scores decay slowly, resulting in regions whose locations clearly violate spatial relationships still receiving high scores. Furthermore, neighboring objects often have similar spatial scores, reducing the discriminative power of objects based on spatial relationships and failing to effectively distinguish the spatial differences between neighboring objects, leading to localization failures in scenarios such as object stacking.
[0038] 2. Existing models struggle to establish accurate alignment of text and images corresponding to spatial relationships.
[0039] Because existing models mostly use batch-based global image-text comparison and alignment, it is difficult to distinguish between spatially consistent and spatially inconsistent objects at a fine-grained level within a single sample. Existing loss functions lack explicit constraints on spatial semantics and cannot suppress "misplaced" objects as negative samples. As a result, the model cannot fully utilize spatial constraints to filter out objects that do not match the representation, thus affecting the localization accuracy under weak supervision.
[0040] like Figure 1 and 4 As shown, this invention proposes a weakly supervised indexing expression understanding method based on spatial relationship enhancement. The steps of the method include: Step S100: Obtain the referential text, extract the triples of the referential text using a large language model, the triples include spatial relations, and construct a spatial transformation matrix based on the spatial relations; In the specific implementation process, this solution uses the existing Large Language Model (LLM) to parse structured spatial triples from referential text. The spatial triples include the main entity, spatial relations, and reference entities. Based on the parsed spatial relations, an initial spatial matrix is constructed, and a final spatial transformation matrix is generated through nonlinear transformation.
[0041] This solution employs a large language model to parse spatial triples, improving the accuracy of spatial semantic understanding. Existing techniques often rely on keyword extraction when processing spatial relationships. These methods only consider local semantics and may lead to errors in spatial understanding. For example, a text may contain "top-left," but only the key location "left" might be extracted. In contrast, this solution utilizes a large language model to parse spatial triples in referential expressions, ensuring more accurate extraction of spatial semantics.
[0042] This scheme enhances the modeling capability of spatial relationships and significantly improves the distinguishability of neighboring objects. Existing techniques typically rely on global feature matching or uniformly varying spatial fractions to represent spatial fractions at different locations, resulting in indistinct differences in spatial fractions between neighboring objects, especially those of the same type, making them difficult to distinguish. In contrast, this scheme designs a spatial transformation matrix that can effectively adjust the trend of spatial fraction changes between adjacent objects, thereby strengthening the distinguishability between them.
[0043] The scheme first constructs an initial space matrix based on the space triples parsed from the denotation expression, and then introduces a variable... Step S200: Obtain the input image, encode the input image using a visual encoder to obtain the main visual features, calculate the dynamic weights of each auxiliary encoder based on the main visual features, select the auxiliary encoder based on the dynamic weights of the auxiliary encoder, calculate the auxiliary features based on the auxiliary encoder, and calculate the fusion features based on the main visual features and the auxiliary features. In the specific implementation process, the visual encoder adopts DarkNet-53. For images, this paper first uses DarkNet-53 to extract hierarchical visual features from the images to obtain the main visual features. ,in This represents the dimension of each grid point relative to the image features, and the entire image feature map is divided into... Each grid point is predefined with three different scales of anchor boxes at each grid point location. Each first sub-feature of each main visual feature corresponds to one grid point.
[0044] In the specific implementation process, to enrich the visual representation, this scheme adopts a dynamic visual routing strategy, using multiple visual encoders to obtain visual features with different emphases. This scheme selects K visual encoders to obtain auxiliary features as follows: To model this fusion process as a sparse routing process, specifically, the router generates dynamic weights based on different visual features as input, selects the auxiliary feature with the largest dynamic weight, and fuses it with the main visual feature, ensuring sparsity and efficiency during inference.
[0045] Step S300: Based on the spatial transformation matrix, filter the fusion sub-features in the fusion features to obtain the cropping features; Step S400: Encode the referred text using a text encoder to obtain text features; then filter the fusion sub-features in the cropping features based on the text features to obtain target fusion sub-features. In the specific implementation process, the best matching grid point is selected during the inference stage, and the final bounding box coordinates are generated by decoding.
[0046] Step S500: Input the target fusion sub-features into a preset prediction head, the prediction head outputs a bounding box, and the input image is marked based on the bounding box.
[0047] This approach addresses the challenges of existing weakly supervised referential expression understanding (WREC) models in distinguishing spatial differences between neighboring objects and accurately aligning text and images based on spatial relationships. Instead of relying on implicit spatial relationship semantic features, this approach constructs an explicit spatial transformation matrix by parsing spatial relationships (such as "top left" and "right side") within the referential expression. This matrix reshapes the initial spatial scores using a nonlinear transformation function (such as the sigmoid function). By adjusting the slope and center value parameters, it increases the score difference between regions that conform to the spatial relationship and their opposite regions, significantly enhancing the spatial discriminability of neighboring objects. Finally, it combines text features and selected target fusion sub-features to determine the bounding box location, completing the labeling process. This approach enables accurate labeling of target locations based on the spatial relationships within the referential expression.
[0048] In some embodiments of the present invention, the spatial relationship is used to express a directional positional relationship. In the step of constructing a spatial transformation matrix based on the spatial relationship, the size of the spatial transformation matrix is determined based on the image size of the input image, the initial position of the spatial transformation matrix is determined based on the spatial relationship, the matrix value of the matrix point at the initial position of the spatial transformation matrix is assigned to 0, and the Euclidean distance between the remaining matrix points and the initial position is calculated. The matrix value of the remaining matrix points is determined based on the calculated Euclidean distance to obtain the spatial transformation matrix.
[0049] This solution introduces a large language model as a semantic parser to automatically parse the input referential expression into structured spatial triples. Here, represents the target subject, i.e., the object ultimately referred to by the referential expression; represents the spatial relation; and represents the reference object. When the input expression lacks a necessary reference object or the relation cannot be uniquely determined, the model output is recorded as NULL to explicitly distinguish between "cannot be determined" and "determined as a certain relation." To achieve accurate extraction of spatial relations, this solution designs a structured cue word to guide the model in completing the spatial triple parsing task. The cue word design follows the logic of "clear target → rule explanation → example guidance → output constraints" to avoid ambiguity in the model. The specific cue word design scheme is shown in Table 1.
[0050] Table 1 This scheme focuses on spatial relationships with clear spatial orientation within a two-dimensional space, and represents them uniformly as sets. Specifically, the sets contain nine typical spatial relationships: left, right, top, bottom, top-left, top-right, bottom-left, bottom-right, and middle.
[0051] Using the spatial triples obtained from text parsing as the core input, this paper constructs a spatial matrix, a process consisting of two steps: initial matrix generation and nonlinear optimization. The specific definition of the spatial matrix generation method is as follows: in, Generate a value range of The matrix. Specifically, based on the currently extracted spatial relation words, corresponding reference positions are selected according to rules, i.e., referencing spatial relationships such as {"left": the left boundary of the image, "right": the right boundary of the image, "top left": the top left corner of the image, "top right": the top right corner of the image, "middle": the middle point of the image}. The spatial fraction of each grid point in the image is {1 - the Euclidean distance between the grid point and the reference position represented by r}. Assume the image size is (h, w), the height is h, the width is w, the coordinates of the top left corner are (0, 0), the coordinates of the top right corner are (0, w), the coordinates of the bottom left corner are (h, 0), the coordinates of the bottom right corner are (h, w), and the coordinates of the middle point are (h / 2, w / 2). Assume the coordinates of a certain pixel in the image are... If the spatial relation r is "top left", and the Euclidean distance from this pixel to the reference position indicated by the spatial relation r is , then the spatial score of this pixel is . ; Matrix J represents a matrix in which all elements are 1.
[0052] In some embodiments of the present invention, in the step of determining the matrix values of the remaining matrix points based on the calculated Euclidean distance, the matrix value is calculated as 1 - the calculated Euclidean distance.
[0053] In some embodiments of the present invention, when the spatial relationship is "top left", the matrix value of each remaining matrix point is calculated using the following formula: in, Indicates coordinates as The matrix values of the matrix points, Indicates the height of the input image. This indicates the width of the input image.
[0054] This represents the calculation of Euclidean distance, where the image pixels of the input image correspond one-to-one with the matrix points in the spatial matrix.
[0055] In some embodiments of the present invention, the step of constructing a spatial transformation matrix based on the spatial relationship further includes optimizing the spatial transformation matrix using an optimization function, wherein the optimization function uses the following calculation formula: in, The coordinates in the spatial transformation matrix are The optimized values of the matrix points The coordinates in the spatial transformation matrix are The values of the matrix points before optimization. and All of these are preset calculation parameters.
[0056] Several issues exist. The decay of its spatial score is relatively gradual. Locations that are significantly inconsistent with the given spatial relationship may still receive high scores. Furthermore, targets that are close to each other often have similar spatial scores. These issues reduce the spatial discrimination ability to some extent. To further enhance the discrimination of these objects based on spatial relationships, this scheme uses a sigmoid function to optimize the spatial matrix.
[0057] In this model, parameter α controls the contrast of spatial scores by adjusting the steepness of the curve; parameter β is used to determine the center position of the translation transformation, thus deciding which spatial locations are emphasized or suppressed. This matrix provides an effective spatial prior for subsequent inference. Finally, this scheme filters grid points that are inconsistent with the spatial relationship indicated by the denotation expression. Specifically, ablation experiments with α and β were performed in Table 2, and we ultimately selected α and β as 20 and 0.5, respectively. Table 2 In some embodiments of the present invention, in the step of calculating the dynamic weights of each auxiliary encoder based on the main visual features and selecting the auxiliary encoder based on the dynamic weights of the auxiliary encoders, the auxiliary encoder with the largest dynamic weight is selected, and the dynamic weights of the auxiliary encoders are calculated using the following formula: Where r represents the dynamic weights of the auxiliary encoder, and the size of the input image corresponds to the size of the main visual features. Indicates the height of the input image. This indicates the width of the input image. This represents the computational weight parameters corresponding to the currently computed auxiliary encoder. This represents the calculated bias parameters corresponding to the currently calculated auxiliary encoder. Indicating the main visual features The first sub-feature of the location, This represents the total number of the first sub-features of the main visual feature.
[0058] In practical implementation, this scheme also sets the calculation weight parameters and bias parameters of the visual encoder, and uses the dynamic weight method of the auxiliary encoder to calculate the dynamic weight parameters of the main visual features. .
[0059] In some embodiments of the present invention, in the step of calculating the fusion feature based on the main visual feature and the auxiliary feature, the fusion feature is calculated using the following formula: in, Indicates fusion characteristics, Indicates auxiliary features, Indicates the main visual features, and These represent the dynamic weight parameters of the corresponding auxiliary features and the main visual features, respectively.
[0060] The auxiliary feature includes multiple second sub-features, and each second sub-feature corresponds one-to-one with the first sub-feature.
[0061] In some embodiments of the present invention, the matrix points of the spatial transformation matrix correspond one-to-one with the fusion sub-features of the fusion feature. In the step of filtering the fusion sub-features in the fusion feature based on the spatial transformation matrix to obtain the trimmed feature, the fusion sub-features with a larger preset proportion are filtered based on the matrix values of the matrix points of the spatial transformation matrix, and the filtered fusion sub-features are combined as the trimmed feature.
[0062] The preset ratio can be 20%.
[0063] In some embodiments of the present invention, the step of filtering the fusion sub-features in the fusion features based on the spatial transformation matrix to obtain the clipping features further includes: calculating the confidence level of the fusion sub-features based on the detection head; calculating the filtering value based on the matrix value of the fusion sub-features at the matrix points of the spatial transformation matrix and the confidence level of the fusion sub-features; filtering the fusion sub-features based on the filtering value; and combining the filtered fusion sub-features as clipping features.
[0064] In the specific implementation process, in the step of filtering the fusion sub-features based on the filtering value, the 50% with larger filtering values can be filtered.
[0065] This scheme employs a two-stage screening strategy to gradually narrow down the range of candidate grid points, thereby obtaining the final cropping features.
[0066] We initially obtained h×w grid points based on the visual encoder.
[0067] The first stage filters grid points solely based on their confidence scores from the visual encoder output. An initial candidate set is used, and then the top 20% of grid points by confidence score are selected from this set to form an intermediate candidate set G1. This eliminates regions that clearly do not contain the target and reduces subsequent computational complexity.
[0068] The second stage further introduces spatial constraints for finer selection. M is also h×w, meaning each grid point will have a corresponding spatial score. For each grid point in set G1, the candidate grid points are reordered according to the selection value = 0.3 * matrix value + confidence score, and 50% of the grid points in G1 are selected to form the final candidate set G2.
[0069] This process ensures that the selected area not only has high visual credibility, but also strictly meets the spatial semantic constraints implicit in the text description, thereby significantly improving the accuracy of target localization driven by spatial relationships.
[0070] In the specific implementation process, this scheme combines the object confidence score and the spatial transformation matrix score to perform a two-stage screening of the fusion sub-features corresponding to the visual grid points.
[0071] In some embodiments of the present invention, in the step of filtering the fusion sub-features in the cropping features based on the text features to obtain the target fusion sub-features, the similarity of each fusion sub-feature is calculated using the following formula: in, Indicates the similarity of the fused sub-features. express The feature value of any one dimension.
[0072] The fusion sub-feature with the highest similarity is used as the target fusion sub-feature.
[0073] In some embodiments of the present invention, the method further includes model pre-training. In the model pre-training step, corresponding bounding boxes are obtained based on each piece of training data in a preset training data set. InfoNCE contrast loss, cross-modal contrast alignment loss, and spatial contrast loss are calculated based on the entire training data set. The sum of InfoNCE contrast loss, cross-modal contrast alignment loss, and spatial contrast loss is calculated as the total loss. Model training is performed based on the total loss.
[0074] In the specific implementation process, during the calculation of the InfoNCE contrast loss, the visual features corresponding to the sample are used as positive samples, and other visual features present in the batch are used as negative samples to construct the InfoNCE contrast loss: Where N is the total number of training data sets. This indicates the calculation of dot product similarity. and Let these represent the target fusion sub-features and text features of the training data n, respectively. This represents the target fusion sub-feature of the training data m. Let represent the InfoNCE contrastive loss. This loss encourages matching image-text pairs to be close to each other in the embedding space, while non-matching samples are kept far apart, thereby building global semantic consistency.
[0075] Since this model integrates features from multiple visual encoders, to ensure consistency between visual representations before and after fusion, this paper employs cross-modal contrastive alignment loss, specifically implemented as follows: in, This represents the cross-modal contrast alignment loss. The loss represents the main visual features of the training data n. This loss is used to ensure that the fused features fully retain the key information of the dominant modality, thereby mitigating the inconsistency caused by the differences in representations between different modalities.
[0076] To further enhance the model's spatial discriminative ability, this scheme introduces spatial contrast loss, specifically designed to penalize candidate regions that are visually or semantically plausible but spatially incorrect. Specifically, for the nth sample, this paper selects the grid point with the highest matching score to the denotation expression from the candidate set G2, and denotes its projected features as... Subsequently, the samples selected based on confidence scores and spatial scores were designated as "negative samples," and their characteristics were denoted as... The spatial contrast loss is defined as follows: in, Indicates spatial contrast loss, This represents the target fusion sub-feature of the training data n. This represents the set of fused sub-features constructed from non-final cropped features in the fused feature set. express middle A corresponding fusion feature.
[0077] This loss explicitly strengthens the spatial consistency constraint, making the model more inclined to select regions that are both semantically relevant and spatially correct, effectively mitigating the mismatch problem of "semantically correct but spatially incorrect".
[0078] In practice, the total loss of this scheme includes a multi-task loss function, including spatial contrast loss, to guide model training.
[0079] This approach introduces spatial contrast loss during the training phase. Unlike traditional batch-to-batch image-text comparison, this loss function, within a single sample, treats visual grid points that conform to spatial relationships as positive samples and grid points with inconsistent spatial positions as negative samples. This mechanism forces the model to learn spatial semantic consistency, effectively suppressing interfering objects that are "semantically similar but misplaced."
[0080] This approach introduces a fine-grained spatial contrast loss to enhance the spatial semantic alignment of images and text. Existing techniques only employ a batch-based global image-text contrast loss. This alignment method primarily optimizes between samples, making it difficult to distinguish between spatially consistent and spatially inconsistent visual grid points within a single sample. In contrast, this approach proposes a spatial contrast loss. During training, this loss function treats visual grid points that conform to spatial relationships as positive samples and spatially inconsistent grid points as negative samples, performing contrastive learning within each individual sample.
[0081] During the inference phase, this scheme first selects the fusion sub-feature that best matches the text features from the candidate set G2 after spatial filtering. The similarity of the fused sub-features is represented as follows: The experimental results of this scheme are as follows: Figure 5 As shown: exist Figure 5 In the diagram, Image represents the original image, SREN represents the result of this proposed solution, and RefCLIP and DViN represent the prediction results of previous models.
[0082] The beneficial effects of this plan include: 1. A method is provided that can accurately model spatial relationships and enhance the distinguishability of neighboring objects.
[0083] This invention designs a spatial transformation matrix that performs a nonlinear transformation on spatial fractions by introducing adjustable parameters. The aim is to enhance the contrast of spatial fractions, amplify the spatial differences between neighboring objects, thereby accurately modeling spatial relationships and solving the problem of insufficient spatial fraction discrimination in existing models.
[0084] 2. Provide a training objective that can enhance spatial semantic image-text alignment.
[0085] This invention aims to propose a spatial contrast loss function that treats grid points that conform to spatial relationships as positive samples and those that are spatially inconsistent as negative samples. The goal is to encourage the model to distinguish between spatially consistent and inconsistent grid points, thereby enhancing the spatial semantic alignment between images and text.
[0086] 3. Achieve high-precision, weakly supervised understanding of referential expressions without requiring manual bounding box annotation.
[0087] This invention enhances spatial relationships while maintaining weak supervision, achieving precise positioning of target objects based on denotation expressions. In particular, it significantly improves positioning accuracy and reduces annotation costs in practical applications for denotation expressions containing explicit spatial relationships (such as "top left," "right side," etc.).
[0088] This invention also provides a weak supervision instruction expression device based on spatial relationship enhancement. The device includes a computer device, which includes a processor and a memory. The memory stores computer instructions, and the processor executes the computer instructions stored in the memory. When the computer instructions are executed by the processor, the device implements the steps of the method described above.
[0089] This invention also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the aforementioned weak supervision instruction expression understanding method based on spatial relation enhancement. The computer-readable storage medium can be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, register, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.
[0090] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this invention are programs or code segments used to perform the desired tasks. The programs or code segments can be stored in a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried in a carrier wave.
[0091] It should be clarified that the present invention is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of the present invention.
[0092] In this invention, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.
[0093] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations of the embodiments of the present invention are possible. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A weakly supervised indexing expression understanding method based on spatial relationship enhancement, characterized in that, The steps of the method include: Obtain the referential text, extract the triples of the referential text using a large language model, the triples include spatial relations, and construct a spatial transformation matrix based on the spatial relations; The input image is acquired, and the input image is encoded using a visual encoder to obtain the main visual features. The dynamic weights of each auxiliary encoder are calculated based on the main visual features. An auxiliary encoder is selected based on the dynamic weights of the auxiliary encoders. Auxiliary features are calculated based on the auxiliary encoders. The fusion features are calculated based on the main visual features and the auxiliary features. Based on the spatial transformation matrix, the fusion sub-features in the fusion features are filtered to obtain the cropping features; The text is encoded using a text encoder to obtain text features. Based on the text features, the fusion sub-features in the cropping features are filtered to obtain the target fusion sub-features. The target fusion sub-features are input into a preset prediction head, which outputs a bounding box. The input image is then labeled based on the bounding box.
2. The weakly supervised indexing expression method based on spatial relationship enhancement according to claim 1, characterized in that, The spatial relationship is used to express directional positional relationships. In the step of constructing a spatial transformation matrix based on the spatial relationship, the size of the spatial transformation matrix is determined based on the image size of the input image, the initial position of the spatial transformation matrix is determined based on the spatial relationship, the matrix value of the matrix point at the initial position of the spatial transformation matrix is assigned to 0, and the Euclidean distance between the remaining matrix points and the initial position is calculated. The matrix value of the remaining matrix points is determined based on the calculated Euclidean distance to obtain the spatial transformation matrix.
3. The weakly supervised indexing expression understanding method based on spatial relationship enhancement according to claim 2, characterized in that, In the step of determining the matrix values of the remaining matrix points based on the calculated Euclidean distance, the matrix value is calculated as 1 - the calculated Euclidean distance.
4. The weakly supervised indexing expression method based on spatial relationship enhancement according to claim 1 or 2, characterized in that, The step of constructing the spatial transformation matrix based on the spatial relationship further includes optimizing the spatial transformation matrix using an optimization function, wherein the optimization function is calculated using the following formula: in, The coordinates in the spatial transformation matrix are The optimized values of the matrix points This represents the value of the matrix point with coordinates (i, j) in the spatial transformation matrix before optimization. and All of these are preset calculation parameters.
5. The weakly supervised indexing expression method based on spatial relationship enhancement according to claim 1, characterized in that, In the step of calculating the dynamic weights of each auxiliary encoder based on the main visual features, and selecting the auxiliary encoder based on the dynamic weights of the auxiliary encoders, the auxiliary encoder with the largest dynamic weight is selected, and the dynamic weights of the auxiliary encoders are calculated using the following formula: Where r represents the dynamic weights of the auxiliary encoder, and the size of the input image corresponds to the size of the main visual features. Indicates the height of the input image. This indicates the width of the input image. This represents the computational weight parameters corresponding to the currently computed auxiliary encoder. This represents the calculated bias parameters corresponding to the currently calculated auxiliary encoder. Indicating the main visual features The first sub-feature of the location, This represents the total number of the first sub-features of the main visual feature.
6. The weakly supervised indexing expression method based on spatial relationship enhancement according to claim 1, characterized in that, In the step of calculating the fusion features based on the main visual features and auxiliary features, the fusion features are calculated using the following formula: in, Indicates fusion features, Indicates auxiliary features, Indicates the main visual features, and These represent the dynamic weight parameters for the corresponding auxiliary features and the main visual features, respectively.
7. The weakly supervised indexing expression method based on spatial relationship enhancement according to claim 1, characterized in that, The matrix points of the spatial transformation matrix correspond one-to-one with the fusion sub-features of the fusion feature. In the step of filtering the fusion sub-features in the fusion feature based on the spatial transformation matrix to obtain the clipping feature, the fusion sub-features with a larger preset proportion are filtered based on the matrix values of the matrix points of the spatial transformation matrix, and the filtered fusion sub-features are combined as the clipping feature.
8. The weakly supervised indexing expression understanding method based on spatial relationship enhancement according to claim 1 or 7, characterized in that, The step of filtering the fusion sub-features in the fusion features based on the spatial transformation matrix to obtain the cropping features further includes: calculating the confidence level of the fusion sub-features based on the detection head; calculating the filtering value based on the matrix value of the fusion sub-features at the matrix points of the spatial transformation matrix and the confidence level of the fusion sub-features; filtering the fusion sub-features based on the filtering value; and combining the filtered fusion sub-features as cropping features.
9. The weakly supervised indexing expression method based on spatial relationship enhancement according to claim 1, characterized in that, The method further includes model pre-training. In the model pre-training step, corresponding bounding boxes are obtained for each piece of training data in the preset training data set. InfoNCE contrast loss, cross-modal contrast alignment loss, and spatial contrast loss are calculated based on the entire training data set. The sum of InfoNCE contrast loss, cross-modal contrast alignment loss, and spatial contrast loss is calculated as the total loss. Model training is performed based on the total loss.
10. A weakly supervised instruction expression device based on spatial relationship enhancement, characterized in that, The device includes a computer device, which includes a processor and a memory, wherein computer instructions are stored in the memory, and the processor is configured to execute the computer instructions stored in the memory. When the computer instructions are executed by the processor, the device implements the steps of the method as described in any one of claims 1 to 9.