Inference-time adaptive open-vocabulary semantic segmentation method, system, and device

By adaptively optimizing the alignment of textual and visual features during the inference phase, the problems of textual ambiguity and visual-language misalignment in remote sensing image semantic segmentation are solved, thereby improving the performance of remote sensing image semantic segmentation and the generalization ability of the model.

CN122244442APending Publication Date: 2026-06-19TIANJIN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TIANJIN UNIV
Filing Date
2026-03-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing remote sensing image semantic segmentation methods are insufficient in recognizing newly emerging categories under open vocabulary settings, and the problems of text ambiguity and polysemy have not been fully resolved, resulting in limited segmentation performance of the models in dynamically changing scenarios.

Method used

We adopt an open-vocabulary semantic segmentation method based on inference-time adaptation. We generate diverse text descriptions through a context-aware text prompt generator and combine feature upsampling of the visual-language model with visual guidance strategies to dynamically optimize text features during the inference stage, thereby alleviating text ambiguity and improving visual-language alignment.

Benefits of technology

It significantly improves the performance of semantic segmentation of remote sensing images and the model's generalization ability in complex scenarios, solves the problems of textual ambiguity and visual-language misalignment, and achieves more efficient semantic segmentation results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244442A_ABST
    Figure CN122244442A_ABST
Patent Text Reader

Abstract

This invention discloses an open-vocabulary semantic segmentation method, system, and device based on inference-time adaptive approach: Based on given basic task information, a context-aware text prompt generator constructs task-driven text prompts, generating context-aware text descriptions for each candidate category; a text encoder and a visual encoder extract text features and visual features from the generated text descriptions and the input remote sensing image, respectively; a feature upsampling module obtains higher-resolution upsampled visual features based on the visual features; during the test inference phase, based on the visual features and the upsampled visual features, a visual-guided inference-time adaptive strategy is used to optimize the text features, obtain a semantic segmentation mask, and complete the semantic segmentation of the open-vocabulary remote sensing image. This invention improves the segmentation performance of remote sensing images by dynamically adjusting the text representation during the inference phase, alleviating text ambiguity, enhancing visual-linguistic alignment in uncertain prediction regions, and improving the overall performance of remote sensing image segmentation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and remote sensing image processing technology, and more specifically, to the design of an open-vocabulary semantic segmentation method, system, and device based on inference-time adaptive processing. Background Technology

[0002] Semantic segmentation of remote sensing images, a core task in computer vision, plays a crucial role in various applications such as land use analysis, urban planning, and environmental monitoring. However, existing semantic segmentation methods for remote sensing images are mainly studied under closed-set conditions, limiting their recognition capabilities to a predefined set of categories. This results in poor generalization ability of the models for newly emerging or unseen categories. This limitation makes it difficult for existing technologies to meet dynamically changing practical needs. For example, in scenarios involving newly constructed infrastructure or continuously evolving land cover types, models often fail to effectively identify newly emerging land features.

[0003] To overcome the aforementioned technical bottlenecks, recent research has begun to explore open-vocabulary semantic segmentation techniques for remote sensing images. By leveraging the capabilities of visual-language models and cross-modal learning, open-vocabulary semantic segmentation allows models to assign pixel-level labels based on arbitrary text descriptions, thus overcoming the limitations of predefined training categories. In existing technologies, some solutions address the poor performance of natural image models in the remote sensing domain by introducing feature upsampling modules, optimizing low-resolution features while maintaining semantic consistency with image content. Other solutions propose rotation-aggregated similarity calculation modules and progressively generate scale-aware semantic maps by integrating multi-scale features, aiming to address the technical challenges posed by arbitrary target distribution orientations and significant scale variations in remote sensing images. Furthermore, research has proposed efficient frameworks specifically tailored for remote sensing images by considering rotation invariance and combining them with self-supervised learning methods to enhance semantic representations containing rich spatial information.

[0004] While the aforementioned research has made some progress, existing technologies primarily focus on optimizing the visual representation of remote sensing images, neglecting the impact of textual information on the performance of open-vocabulary semantic segmentation. In this invention, the inventors argue that a key textual ambiguity problem exists in open-vocabulary remote sensing image semantic segmentation tasks. This ambiguity mainly stems from synonymy (visually similar features are assigned different labels) and polysemy (a single class name corresponds to distinctly different visual features). Furthermore, existing partial-domain adaptive methods require model training on specific annotation benchmarks, which may lead to overfitting and limit the model's scalability. Summary of the Invention

[0005] To overcome the shortcomings of existing technologies and address the semantic ambiguity problem in open-vocabulary semantic segmentation of remote sensing images, this invention proposes an open-vocabulary semantic segmentation method, system, and device based on inference-time adaptive approach. By dynamically adjusting the text representation during the inference stage, text ambiguity is alleviated, and visual-linguistic alignment of uncertain prediction regions is enhanced, thereby further improving the segmentation performance of remote sensing images.

[0006] The objective of this invention can be achieved through the following technical solutions.

[0007] An open-lexical semantic segmentation method based on inference-time adaptation includes the following steps: S1, based on the given basic task information, uses a context-aware text prompt generator to construct task-driven text prompts for each candidate category. ( Generate a set containing A diverse and context-aware text description; S2 utilizes a pre-trained visual-language model with a text encoder and a visual encoder to extract text features from the generated context-aware text description and the input remote sensing image, respectively. and visual features ;in, Indicates the number of candidate categories. This represents the number of context-aware text descriptions for each candidate category. and These represent the height and width of the visual feature, respectively. Dimensions representing textual and visual features; S3 utilizes the feature upsampling module of a pre-trained vision-language model, based on visual features. Acquire higher resolution upsampled visual features H W These represent the height and width of the visual feature after upsampling, respectively; S4, during the test reasoning phase, based on visual features and upsampled visual features A visually guided adaptive strategy for reasoning is employed to optimize text features. Obtain the semantic segmentation mask Complete the semantic segmentation process of open-vocabulary remote sensing images.

[0008] Furthermore, the context-aware text prompt generator described in step S1 is built based on a large language model. Its core components include a system text prompt module, a dataset overview module, and a visual feature diversity constraint strategy. The system text prompt module is used to define the overall goal of the task and standardize the output format, thereby generating structured, task-driven response information. The dataset overview module is used to provide domain-level and scene-level contextual information and output text descriptions aligned with the visual features of the dataset. The visual feature diversity constraint strategy is used to ensure that the generated text descriptions can cover the multidimensional visual attributes of categories in remote sensing images.

[0009] Furthermore, in step S4, during the test reasoning phase, based on visual features... and upsampled visual features A visually guided adaptive strategy for reasoning is employed to optimize text features. Obtain the semantic segmentation mask The specific process is as follows: S41, based on uncertainty estimation, extract the visual features with the highest confidence among all candidate categories and calculate their average value, which is then processed by learnable parameters. After dynamic weighting, calibrate text features The calibrated text features are obtained. ; S42, based on upsampled visual features and calibrated text features Obtain the optimized probability distribution of each candidate category. During the inference phase, the loss function is minimized using pixel-level entropy on the learnable parameters. Optimize; S43, Calculate the calibrated text features based on the optimized learnable parameters. Compared with upsampled visual features The similarity between them is used to generate a semantic segmentation mask. Complete the semantic segmentation process of open-vocabulary remote sensing images.

[0010] Further, in step S41, the visual features with the highest confidence among all candidate categories are extracted based on uncertainty estimation, and their average value is calculated. This average value is obtained through learnable parameters. After dynamic weighting, calibrate text features The calibrated text features are obtained. The specific process is as follows: First, computational visual features Text features The similarity between them is used to obtain the predicted probability of each pixel in the remote sensing image across all candidate categories. The specific calculation formula is as follows: (1), In the formula, variables Represents the predicted probability distribution; This represents the visual features extracted from the visual encoder; Represents the text features generated by the text encoder; symbols This represents the matrix transpose operation; This represents the normalized exponential function; Subsequently, the predicted probability distribution at each pixel location in the remote sensing image was calculated. The entropy value is used to estimate the pixel-level prediction uncertainty. The specific calculation formula is as follows: (2), In the formula, variables This indicates the coordinate position ( ) pixels are assigned to the first Predicted probabilities of each candidate category; variables Indicates the coordinate position ( The forecast uncertainty at ) After that, regarding the first From the candidate categories, this invention selects the top candidates with the lowest entropy values ​​that are predicted to belong to that candidate category. Each pixel location is used to extract its corresponding visual features to form a feature set. The specific definition is as follows: (3), (4), In the formula, variables Describe the preliminary mask prediction results; symbols Indexing operator for finding the maximum value; variable Represents an uncertain distribution The average value; variable Representation of visual feature matrix In position ( eigenvectors at position ( ); variables Indicates the first The set of locations of high-confidence visual features in each category; Finally, extract the visual features with the highest confidence among all candidate categories and calculate their mean. : (5), In the formula, Represents the average visual features at the category level. This represents the visual feature with the highest confidence among all candidate categories; Average visual features at the category level Learnable parameters Dynamic weighting is used to calibrate text features. The final calibrated text features Calculate using the following formula: (6), In the formula, the product term Defined as the bias of visually guided text prompts, variables By averaging visual features at the category level The matrix constructed by repeating the same arrangement has the same size as the text feature matrix. Matching; learnable parameters It is a learnable matrix initialized to zero, used to determine the strength of the visual features injected into each category description; symbol This indicates element-wise multiplication.

[0011] Furthermore, the optimized probability distribution of each candidate category described in step S42 The formula is as follows: (7), By optimizing the following pixel-level entropy minimization loss function, the learnable parameters are optimized. Optimize: (8), In the formula, These are the optimized learnable parameters.

[0012] Furthermore, the semantic segmentation mask described in step S43 The formula is as follows: (9), In the formula, variables Indicates cosine similarity; This represents the normalized exponential function, which maps the calculated similarity score to a probability distribution in the interval between 0 and 1, and the sum of the probabilities of all candidate categories is 1. This represents the maximum value indexing operator, used to filter out the category index with the highest probability value among all candidate categories; The specific index number representing the predicted category.

[0013] The objective of this invention can also be achieved through the following technical solutions.

[0014] An open-lexical semantic segmentation system based on inference-time adaptation includes: Context-aware text prompt generator: Based on given basic task information, it constructs task-driven text prompts for each candidate category. ( Generate a set containing A diverse and context-aware text description; Text encoder: Extracts text features from a generated context-aware text description and an input remote sensing image. ; Visual encoder: Extracts visual features from input remote sensing images ; Feature upsampling module: based on visual features Acquire higher resolution upsampled visual features ; Visually-guided adaptive reasoning strategy: During the test reasoning phase, based on visual features... and upsampled visual features Optimize text features Obtain the semantic segmentation mask .

[0015] An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the above-described open-vocabulary semantic segmentation method based on inference-time adaptation.

[0016] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the above-described inference-time adaptive open-lexical semantic segmentation method.

[0017] Compared with the prior art, the beneficial effects of the technical solution of the present invention are: (1) In view of the text ambiguity problem in open vocabulary remote sensing image semantic segmentation, this invention proposes for the first time an open vocabulary semantic segmentation technology with plug-and-play characteristics and high efficiency based on inference time multi-prompt adaptive, which can significantly improve the performance of open vocabulary remote sensing image semantic segmentation.

[0018] (2) The context-aware text prompt generator proposed in this invention can generate a diverse set of context-aware text prompt semantic descriptions, aiming to alleviate the ambiguity of semantic expression. By constructing task-driven text prompts containing basic task information (such as task scenario and category name), the large language model is guided to generate a diverse set of context-aware descriptive texts for each category. The generated task-driven text prompts can alleviate the inherent text ambiguity in the original category names and solve the problem of inconsistent annotation between different datasets. (3) Given the potential misalignment between pre-generated task-driven text prompts and image visual features in different tasks, especially when dealing with land cover categories with highly similar visual features, the proposed visual-guided inference-time adaptive strategy ensures superior generalization performance of the model across different application scenarios by dynamically optimizing the feature embedding of text prompts during the inference stage. Since visual-language misaligned image regions often exhibit high entropy (low confidence) uncertainty during prediction, this invention calculates and filters low-entropy (high confidence) visual features and constructs a prompt bias using a weighted combination of these features, thereby enhancing the matching accuracy of uncertain regions. During the inference stage, this invention further introduces a pixel-level entropy minimization loss function, establishing a semantic association between uncertain prediction regions and high-confidence visual features by optimizing the weights of each visual feature in the prompt bias. This mechanism effectively improves the alignment quality between visual representation and text semantics, thereby enhancing the model's generalization performance and cross-domain adaptability in complex remote sensing application scenarios. Attached Figure Description

[0019] Figure 1 This is an architecture diagram of the open-vocabulary remote sensing image segmentation method based on inference-time multi-cue adaptive approach.

[0020] Figure 2 This is a schematic diagram of a context-aware text prompt generator.

[0021] Figure 3 An example of a context-aware text hint generator. Detailed Implementation

[0022] The present invention will now be further described with reference to the accompanying drawings.

[0023] Given an input remote sensing image and a set of candidate concepts defined by natural language. The goal of open-vocabulary remote sensing image semantic segmentation is to predict the semantic segmentation mask. This allows for the assignment of corresponding candidate concept labels to each pixel. Indicates the number of candidate categories. and These represent the height and width of the visual feature, respectively. Candidate concept set. The length of the lexical unit can be arbitrary, which plays a crucial role in the performance of text-guided segmentation. However, the inherent textual ambiguity in candidate concept labels severely limits segmentation performance. Therefore, this invention proposes a plug-and-play and efficient open-lexical semantic segmentation method based on inference-time adaptation, which can be flexibly integrated into various existing segmentation architectures.

[0024] like Figure 1As shown, the present invention is based on an open-vocabulary semantic segmentation method that is adaptive during reasoning, specifically including the following steps S1 to S4.

[0025] S1, based on the given basic task information, uses a context-aware text prompt generator to construct task-driven text prompts for each candidate category. ( Generate a set containing A diverse and context-aware text description.

[0026] In open-vocabulary remote sensing image semantic segmentation, visual-language models (such as CLIP) achieve inference under open-vocabulary settings by projecting test images and candidate category names into a shared feature space. Therefore, the quality of text prompts directly determines the model's understanding of the target concept and significantly affects its generalization ability to new scenes. However, directly using the original category names as text prompts introduces textual ambiguity due to inconsistent annotation standards and lexical differences (such as polysemy and synonymy), thereby impairing the alignment between textual and visual features.

[0027] To address the aforementioned technical problems, alleviate the inherent textual ambiguity in the original class names, and resolve semantic inconsistencies between different datasets, this invention designs a context-aware text prompt generator. This generator prompts large language models (such as Gemini) to generate diverse and context-aware text descriptions for each candidate category. Figure 2 As shown, the context-aware text prompt generator is built upon a large language model and mainly comprises three core components: a system text prompt module, a dataset overview module, and a visual feature diversity constraint strategy. In practical applications, this invention generates multiple different descriptive statements for each candidate category, for example, generating five text descriptions for each candidate category, and using these as text prototypes to guide open-vocabulary segmentation. The system text prompt module defines the overall task objective and standardizes the output format, thereby guiding the large language model to generate structured, task-driven response information. The dataset overview module provides domain-level and scene-level contextual information and outputs text descriptions aligned with the visual features of the dataset, thus improving the robustness of the segmentation process to textual ambiguity. The visual feature diversity constraint strategy ensures that the generated text descriptions cover the multidimensional visual attributes of candidate categories in remote sensing imagery, thereby enhancing the discriminative power and expressive richness of the text representation.

[0028] In summary, context-aware text prompt generators can provide effective text descriptions with rich visual features for target datasets, significantly enhancing the model's understanding of ground feature concepts. Figure 3Using the "Buildings" category in the WHU_Aerial dataset as an example, we can visually demonstrate the generated text description examples.

[0029] S2 utilizes a pre-trained visual-language model (such as CLIP) to extract text features from the generated context-aware text description and the input remote sensing image, respectively, using both the text encoder and visual encoder. and visual features .in, Indicates the number of candidate categories. This represents the number of context-aware text descriptions for each candidate category. and These represent the height and width of the visual feature, respectively, while the variables... The dimensions representing textual and visual features.

[0030] S3, in the vision branch, utilizes the feature upsampling module of a pre-trained vision-language model (such as CLIP) based on visual features. Acquire higher resolution upsampled visual features H W These represent the height and width of the visual feature after upsampling, respectively.

[0031] S4, during the test reasoning phase, based on visual features and upsampled visual features A visually guided adaptive strategy for reasoning is employed to optimize text features. Obtain the semantic segmentation mask Complete the semantic segmentation process of open-vocabulary remote sensing images.

[0032] The visual-guided adaptive inference strategy described in this invention aims to address the visual-text misalignment problem that persists even with multi-text prompts, despite variations in application scenarios and tasks. Visual analysis of the prediction entropy obtained from the similarity distribution between visual and textual features reveals that correctly predicted image regions consistently exhibit low entropy values, indicating high confidence; while incorrectly predicted image regions correspond to high entropy values, indicating high uncertainty. Given that visual-language misalignment leads to erroneous predictions, this invention utilizes auxiliary visual feature matching to improve the visual-language alignment of high-uncertainty prediction regions. The core of this strategy lies in integrating the matching between the visual features of uncertain prediction regions and the corresponding high-confidence predicted visual features into the visual-language alignment process during the inference phase.

[0033] The specific optimization process of the visually guided adaptive inference strategy described in this invention is as follows: S41, based on uncertainty estimation, extract the visual features with the highest confidence among all candidate categories and calculate their average value, which is then processed by learnable parameters. The text cue bias is constructed by dynamically weighting and combining the initial text features. The calibration is performed to obtain the calibrated text features. .

[0034] To obtain high-confidence visual features for constructing text cue bias, this invention first calculates the visual features. Text features The similarity between them is used to obtain the predicted probability of each pixel in the remote sensing image across all candidate categories. The specific calculation formula is as follows: (1) In the formula, variables Represents the predicted probability distribution; This represents the visual features extracted from the visual encoder; Represents the text features generated by the text encoder; symbols This represents the matrix transpose operation; This represents the normalized exponential function.

[0035] Subsequently, this invention predicts the probability distribution at each pixel location in the remote sensing image by calculating the probability distribution. The entropy value is used to estimate the pixel-level prediction uncertainty. The specific calculation formula is as follows: (2) In the formula, variables This indicates the coordinate position ( ) pixels are assigned to the first Predicted probabilities of each candidate category; variables Indicates the coordinate position ( The prediction uncertainty at ().

[0036] After that, regarding the first From the candidate categories, this invention selects the top candidates with the lowest entropy values ​​that are predicted to belong to that candidate category. Each pixel location is used to extract its corresponding visual features to form a feature set. The specific definition is as follows: (3) (4) In the formula, variables Describe the preliminary mask prediction results; symbols Indexing operator for finding the maximum value; variable Represents an uncertain distribution The average value; variable Representation of visual feature matrix In position ( eigenvectors at position ( ); variables Indicates the first The set of locations of high-confidence visual features in each category.

[0037] Finally, this invention extracts the visual features with the highest confidence among all candidate categories and calculates their mean. : (5) In the formula, Represents the average visual features at the category level. This represents the visual feature with the highest confidence among all candidate categories.

[0038] Average visual features at the category level Learnable parameters Dynamic weighting is used to calibrate text features. The final calibrated text features Calculate using the following formula: (6) In the formula, the product term Defined as the bias for visually guided text prompts. Variable By averaging visual features at the category level The matrix constructed by repeating the same arrangement has the same size as the text feature matrix. Matching. Learnable parameters. It is a learnable matrix initialized to zero, used to determine the strength of the visual features injected into each category description. (Symbol) This indicates element-wise multiplication.

[0039] S42, based on upsampled visual features and calibrated text features Obtain the optimized probability distribution of each candidate category. During the inference phase, the loss function is minimized using pixel-level entropy on the learnable parameters. Optimize.

[0040] To optimize the inference phase by including learnable parameters To address text-based prompt bias, this invention introduces a pixel-level entropy minimization loss function for segmentation tasks. Specifically, given a test image and its context-aware text description, this invention can acquire corresponding upsampled visual features. And prompts for text features after bias calibration This allows us to obtain the optimized probability distributions for each category. : (7) The pixel-level entropy minimization loss function aims to minimize pixel-level entropy. By optimizing the following pixel-level entropy minimization loss function, the learnable parameters are improved. Optimize: (8) In the formula, These are the optimized learnable parameters.

[0041] The pixel-level entropy minimization loss function described above aims to optimize learnable parameters. This increases the separability of the probability distribution of each pixel category and simultaneously alleviates the uncertainty of prediction.

[0042] S43, based on optimized learnable parameters Calculate the calibrated text features Compared with upsampled visual features The similarity between them is used to generate a semantic segmentation mask. Complete the semantic segmentation process of open-vocabulary remote sensing images.

[0043] (9), In the formula, variables Indicates cosine similarity; This represents the normalized exponential function, which maps the calculated similarity score to a probability distribution in the interval between 0 and 1, and the sum of the probabilities of all candidate categories is 1. This represents the maximum value indexing operator, used to filter out the category index with the highest probability value among all candidate categories; The specific index number representing the predicted category.

[0044] Based on the optimized learnable parameters This invention can generate the final segmentation result. By utilizing calibrated text features, this invention can significantly alleviate the prediction uncertainty of ambiguous regions, thereby producing more robust and reliable remote sensing image segmentation results.

[0045] Based on the specific steps of the aforementioned open-vocabulary semantic segmentation methods, this invention proposes a plug-and-play and effective inference-time adaptive open-vocabulary semantic segmentation system. This system can be flexibly integrated into various existing segmentation architectures and mainly consists of components such as a context-aware text prompt generator, a text encoder, a visual encoder, and a visually guided inference-time adaptive strategy. This invention aims to address the textual ambiguity problem in remote sensing image segmentation tasks and enhance the visual-linguistic matching degree of uncertain prediction regions by introducing a context-aware text prompt generator and a visually guided inference-time adaptive strategy. In this invention, the text encoder and visual encoder can respectively employ a pre-trained visual-linguistic model (such as CLIP) for the text encoder and a pre-trained visual-linguistic model (such as CLIP) for the visual encoder. Context-aware text prompt generator: Based on given basic task information, it constructs task-driven text prompts for each candidate category. ( Generate a set containing A diverse and context-aware text description, Text encoder: Extracts text features from a generated context-aware text description and an input remote sensing image. .

[0046] Visual encoder: Extracts visual features from input remote sensing images ; Feature upsampling module: based on visual features Acquire higher resolution upsampled visual features ; Visually Guided Adaptive Reasoning Strategies: Based on Text Features Visual features Upsampled visual features To obtain the final semantic segmentation mask. .

[0047] Based on the specific steps of the above-mentioned open-vocabulary semantic segmentation method, this invention also proposes an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the above-mentioned open-vocabulary semantic segmentation method based on inference-time adaptation.

[0048] Based on the specific steps of the above-mentioned open lexical semantic segmentation method, this invention also proposes a computer-readable storage medium storing a computer program thereon. When the computer program is executed by a processor, the above-mentioned open lexical semantic segmentation method based on inference-time adaptation is implemented. Specific Implementation In this embodiment, the invention is evaluated using open-vocabulary remote sensing image segmentation datasets, mainly including eight multi-class semantic segmentation datasets (OpenEarthMap, LoveDA, iSAID, Potsdam, Vaihingen, UAVid, UDD5, and VDD) and a single-class extraction segmentation dataset (WHU). Aerial WHU Sat.II Inria, xBD pre , CHN6-CUG, DeepGlobe, Massachusetts, SpaceNet and WBS-SI).

[0050] Both the visual encoder and text encoder in the model are initialized based on CLIP ViT-B / 16, and the long side of the input image is adjusted to 448 pixels. The inference process uses a sliding window of size 224×224 pixels with a step size of 112 pixels. During the adaptive process in inference, all original parameters of the visual encoder, text encoder, and feature upsampling module in the baseline model are frozen; this invention only dynamically optimizes the text cue bias. The adaptive process uses the Adam optimizer with 3 optimization steps.

[0051] During the performance evaluation phase, this invention uses the average intersection-over-union (IoU) ratio to measure the performance of multi-class semantic segmentation. For single-class extraction tasks, this invention uses the IoU ratio of the foreground class as the metric. All programs are implemented in PyTorch and run on eight RTX A6000 GPUs.

[0052] Although the functions and working processes of the present invention have been described above in conjunction with the accompanying drawings, the present invention is not limited to the specific functions and working processes described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of the present invention without departing from the spirit and scope of the claims, and all of these are within the protection scope of the present invention.

Claims

1. An open-lexical semantic segmentation method based on inference-time adaptation, characterized in that, Includes the following steps: S1, based on the given basic task information, uses a context-aware text prompt generator to construct task-driven text prompts for each candidate category. ( Generate a set containing A diverse and context-aware text description; S2 utilizes a pre-trained visual-language model with a text encoder and a visual encoder to extract text features from the generated context-aware text description and the input remote sensing image, respectively. and visual features ;in, Indicates the number of candidate categories. This represents the number of context-aware text descriptions for each candidate category. and These represent the height and width of the visual feature, respectively. Dimensions representing textual and visual features; S3 utilizes the feature upsampling module of a pre-trained vision-language model, based on visual features. Acquire higher resolution upsampled visual features H W These represent the height and width of the visual feature after upsampling, respectively; S4, during the test reasoning phase, based on visual features and upsampled visual features A visually guided adaptive strategy for reasoning is employed to optimize text features. Obtain the semantic segmentation mask Complete the semantic segmentation process of open-vocabulary remote sensing images.

2. The open-lexical semantic segmentation method based on inference-time adaptation according to claim 1, characterized in that, The context-aware text prompt generator described in step S1 is built on a large language model, and its core components include a system text prompt module, a dataset overview module, and a visual feature diversity constraint strategy. The system text prompt module is used to define the overall goal of the task and standardize the output format, thereby generating structured, task-driven response information; The dataset overview module provides domain-level and scene-level contextual information and outputs text descriptions aligned with the visual features of the dataset. The visual feature diversity constraint strategy ensures that the generated text descriptions cover the multidimensional visual attributes of categories in remote sensing images.

3. The open-lexical semantic segmentation method based on inference-time adaptation according to claim 1, characterized in that, In step S4, during the test reasoning phase, based on visual features... and upsampled visual features A visually guided adaptive strategy for reasoning is employed to optimize text features. Obtain the semantic segmentation mask ; The specific process is as follows: S41, based on uncertainty estimation, extract the visual features with the highest confidence among all candidate categories and calculate their average value, which is then processed by learnable parameters. After dynamic weighting, calibrate text features The calibrated text features are obtained. ; S42, based on upsampled visual features and calibrated text features Obtain the optimized probability distribution of each candidate category. During the inference phase, the loss function is minimized using pixel-level entropy on the learnable parameters. Optimize; S43, Calculate the calibrated text features based on the optimized learnable parameters. With upsampled visual features The similarity between them is used to generate a semantic segmentation mask. Complete the semantic segmentation process of open-vocabulary remote sensing images.

4. The open-lexical semantic segmentation method based on inference-time adaptation according to claim 3, characterized in that, Step S41 describes extracting the visual features with the highest confidence among all candidate categories based on uncertainty estimation and calculating their average value, which is then processed by learnable parameters. After dynamic weighting, calibrate text features The calibrated text features are obtained. The specific process is as follows: First, computational visual features Text features The similarity between them is used to obtain the predicted probability of each pixel in the remote sensing image across all candidate categories. The specific calculation formula is as follows: (1), In the formula, variables Represents the predicted probability distribution; This represents the visual features extracted from the visual encoder; Represents the text features generated by the text encoder; symbols This represents the matrix transpose operation; This represents the normalized exponential function; Subsequently, the predicted probability distribution at each pixel location in the remote sensing image was calculated. The entropy value is used to estimate the pixel-level prediction uncertainty. The specific calculation formula is as follows: (2), In the formula, variables This indicates the coordinate position ( ) pixels are assigned to the first Predicted probabilities of each candidate category; variables Indicates the coordinate position ( The forecast uncertainty at ) After that, regarding the first From the candidate categories, this invention selects the top candidates with the lowest entropy values ​​that are predicted to belong to that candidate category. Each pixel location is used to extract its corresponding visual features to form a feature set. The specific definition is as follows: (3), (4), In the formula, variables Describe the preliminary mask prediction results; symbols Indexing operator for finding the maximum value; variables Represents an uncertain distribution The average value; variable Representing the visual feature matrix At position ( eigenvectors at position ( ); variables Indicates the first The set of locations of high-confidence visual features in each category; Finally, extract the visual features with the highest confidence among all candidate categories and calculate their mean. : (5), In the formula, Represents the average visual features at the category level. This represents the visual feature with the highest confidence among all candidate categories; Average visual features at the category level Learnable parameters Dynamic weighting is used to calibrate text features. The final calibrated text features Calculate using the following formula: (6), In the formula, the product term Defined as the bias of visually guided text prompts, variables By averaging visual features at the category level The matrix constructed by repeating the same arrangement has the same size as the text feature matrix. Matching; learnable parameters It is a learnable matrix initialized to zero, used to determine the strength of the visual features injected into each category description; symbol This indicates element-wise multiplication.

5. The open-lexical semantic segmentation method based on inference-time adaptation according to claim 3, characterized in that, The optimized probability distribution of each candidate category described in step S42 The formula is as follows: (7), By optimizing the following pixel-level entropy minimization loss function, the learnable parameters are optimized. Optimize: (8), In the formula, These are the optimized learnable parameters.

6. The open-lexical semantic segmentation method based on inference-time adaptation according to claim 3, characterized in that, The semantic segmentation mask described in step S43 The formula is as follows: (9), In the formula, variables Indicates cosine similarity; This represents the normalized exponential function, which maps the calculated similarity score to a probability distribution in the interval between 0 and 1, and the sum of the probabilities of all candidate categories is 1. This represents the maximum value indexing operator, used to filter out the category index with the highest probability value among all candidate categories; The specific index number representing the predicted category.

7. A reasoning-time adaptive open lexical semantic segmentation system based on the reasoning-time adaptive open lexical semantic segmentation method according to any one of claims 1 to 6, characterized in that, include: Context-aware text prompt generator: Based on given basic task information, it constructs task-driven text prompts for each candidate category. ( Generate a set containing A diverse and context-aware text description; Text encoder: Extracts text features from a generated context-aware text description and an input remote sensing image. ; Visual encoder: Extracts visual features from input remote sensing images ; Feature upsampling module: based on visual features Acquire higher resolution upsampled visual features ; Visually-guided adaptive reasoning strategy: During the test reasoning phase, based on visual features... and upsampled visual features Optimize text features Obtain the semantic segmentation mask .

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the open-vocabulary semantic segmentation method based on inference-time adaptation as described in any one of claims 1 to 6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the open-vocabulary semantic segmentation method based on inference-time adaptation as described in any one of claims 1 to 6.