A calibration method for adapting to a single preparation-based visual language model test
By initializing the cache with synthetic data and dynamically calibrating weight learning using a lightweight probe network, the calibration error and overconfidence issues during visual language model testing are addressed. This achieves improved model reliability and accuracy while maintaining efficient inference, making it suitable for high-risk domains.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIAMEN UNIV
- Filing Date
- 2026-02-24
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244585A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of visual language model image processing technology, specifically to a calibration method for adapting visual language model testing based on a single preparation. Background Technology
[0002] Visual language models, exemplified by CLIP, have demonstrated exceptional zero-shot generalization capabilities through pre-training on billions of image-text pairs. However, existing research indicates that their performance significantly degrades when faced with out-of-distribution data that differs from the pre-training data distribution or when domain shifts occur. To address this gap, Test-Time Adaptation (TTA) has emerged, aiming to dynamically adjust the model during the inference phase using unlabeled test data to adapt to new data distributions.
[0003] Early TTA research primarily focused on cue word adjustment paradigms. For example, Shu et al. proposed a test-time cue adjustment method that optimizes learnable cue words online by minimizing the prediction entropy of a single test sample across multiple augmented views. While such methods offer performance improvements, their reliance on test-time backpropagation iterations leads to high inference latency, making them unsuitable for real-time applications. To address efficiency issues, research has gradually shifted towards cache-based training-free dynamic adaptation (TDA) methods: Karmanov et al.'s TDA method achieves efficient non-parametric adaptation by constructing a dynamic key-value cache, such as... Figure 1 As shown; Zhang et al.'s BoostAdapter further expands the cache effective for a single test image through data augmentation, such as... Figure 2 As shown, these caching-based methods avoid parameter updates, greatly improving inference speed. At the same time, these caching-based methods also boast excellent performance, eliminating the trade-off between efficiency and accuracy and achieving a win-win situation.
[0004] In contrast to their superior accuracy and inference speed, existing efficiency-driven TTA methods suffer from serious deficiencies in model calibration. Model calibration is a property that evaluates how well predicted probabilities match the true latent probability distribution. A perfectly calibrated model, when assigned a confidence level of 0.8 for its predictions, should be correct 80% of the time. Research by Sheng et al. indicates that existing TTA methods may sacrifice model calibration and intrinsic robustness for marginal accuracy improvements. The work of Guo et al. elaborates on the importance of model calibration and proposes temperature scaling as a standard post-processing calibration method. However, traditional calibration methods typically require an independent, identically distributed (i.i.d.) labeled validation set, which is often unavailable in test-time adaptive scenarios.
[0005] Specifically, cache-based TTA methods use fixed hyperparameters to aggregate the raw CLIP prediction scores and positive and negative cached prediction scores, and face a "cold start" problem when the cache is empty during the initial inference phase. This mechanism leads to severe overconfidence, making the high-confidence predictions given by the model unreliable, posing a significant risk in high-risk fields such as medical diagnosis or autonomous driving. Currently, there is a lack of effective solutions for the test-time calibration problem of cache-based TTA models. Summary of the Invention
[0006] In view of the problems existing in the prior art, the purpose of this invention is to provide a calibration method for testing visual language models based on single preparation, which significantly reduces calibration error and improves the reliability of model prediction while maintaining inference efficiency.
[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows: A calibration method for testing a visual language model based on single-preparation setup includes the following steps: Step 1: Generate synthetic data; For each category of the target data, a synthetic image is generated using a text-to-image generative model; Step 2: Feature extraction and dataset partitioning; The image encoder of the visual language model extracts the feature vectors of all synthesized images to construct a synthetic feature set. The synthesized feature set is divided into two disjoint subsets: a pre-filled set and a pre-filled set. and calibration set ; Step 3: Cache pre-population; Using pre-filled sets and its corresponding tags Initialize and fill the positive and negative sample caches in the no-training dynamic adaptive method; Step 4: Construction and training of a lightweight probe network; Construct and train a lightweight probe network, whose input is the prediction score of the original visual language model. Predicted score for positive sample cache retrieval Predicted scores for negative sample cache retrieval The concatenated vector; the output is the dynamically calibrated weights. ; Using calibration set The lightweight detection network is trained with the goal of minimizing the calibration error between the predicted probability and the synthetic label. Step 5: Dynamically calibrate the inference during testing; During the testing phase, for each input test image... Perform the following steps: Step 5.1: Extract image features ; Step 5.2: Calculate the original CLIP prediction ; Step 5.3: Calculate positive sample cache prediction based on the pre-filled cache. and negative sample cache prediction ; Step 5.4, After concatenation, the data is input into a pre-trained lightweight probe network to obtain dynamic weights. and ; Step 5.5: Calculate the final calibration prediction result according to the following formula. The classification probability is then output after passing through Softmax.
[0008] In step 4, the lightweight detection network is a lightweight multilayer sensor.
[0009] In step 4, the training loss function for the lightweight probe network uses the Brier Score: ,in, It is the model's predicted probability. These are the real labels of the synthetic data.
[0010] In step 5.5, the prediction results are calibrated. as follows: .
[0011] The method further includes: Step 5.6: Cache update; After obtaining the prediction results, high-confidence samples are selected based on the confidence threshold. Following the traditional no-training dynamic adaptive strategy, the features of the real test samples are dynamically updated to the cache, gradually replacing or supplementing the synthetic data to achieve further adaptation to the real distribution.
[0012] After adopting the above scheme, this invention first employs synthetic data pre-filling, using a generative model to generate a very small number of images offline for each category in the target domain. These images are then used to initialize the cache, completely resolving the noise problem in the "cold start" stage and providing a reliable initial reference for the model. Secondly, a lightweight probe network is designed. By learning to dynamically predict the optimal fusion weights, it adaptively balances the original predictions and cached evidence, replacing the original fixed weights and making the prediction confidence more closely match the actual accuracy. The entire process strictly adheres to the principle of high efficiency. All additional operations (data generation, network training) are completed in a one-time offline preparation, without consuming online time. Online inference only adds a very small forward propagation overhead, which has almost no impact on speed. This invention significantly reduces calibration errors while maintaining the original high inference efficiency, achieving a synergistic improvement in accuracy, calibration, and speed, and significantly enhancing the predictive reliability of the model in real-time high-risk scenarios. Attached Figure Description
[0013] Figure 1 This is a block diagram illustrating the principle of existing technology 1; Figure 2 This is a principle block diagram of prior art 2; Figure 3 This is a schematic diagram of the principle of the present invention; Figure 4 This is to explain the sensitivity of the method of the present invention to the amount of synthesized data. Detailed Implementation
[0014] like Figure 3 As shown, this invention discloses a calibration method for test-time adaptation of visual language models based on single-preparation, which enhances calibration effectiveness by introducing an offline preparation phase comprising two components. First, synthetic data generation utilizes a diffusion model to generate data and populate a cache to address the cold start problem. Second, a lightweight probe network is trained on this synthetic data to dynamically predict calibration weights during the test-time inference phase, thereby combining logits.
[0015] The method of the present invention specifically includes the following steps: Step 1: Synthetic Cache Pre-filling (SCP).
[0016] For each category of the target data, a small number of synthetic images are generated using a text-to-image generative model (such as Stable Diffusion). Specifically, a cue word template (e.g., "a photo of a [CLASS]") is constructed based on the category name and input into the generative model to obtain a set of synthetic images.
[0017] Step 2: Feature extraction and dataset partitioning.
[0018] Image encoders using visual language models (such as CLIP) extract feature vectors from all synthesized images to construct a synthetic feature set. The synthesized feature set is divided into two disjoint subsets: the pre-filled set and the pre-filled set. and calibration set .
[0019] Step 3: Cache pre-population.
[0020] Using pre-filled sets and its corresponding tags This step initializes and fills the positive and negative sample caches in the no-training dynamic adaptive method. This ensures that the caches have dense reference features from the very beginning of inference, achieving a "warm start" for the model and avoiding retrieval noise caused by cache sparsity in the initial stage.
[0021] Step 4: Construction and training of a lightweight probe network.
[0022] Construct a lightweight multilayer perceptron, called a lightweight detection network, denoted as . The input to this lightweight detection network is the prediction score of the original visual language model. Predicted score for positive sample cache retrieval Predicted scores for negative sample cache retrieval The concatenated vector; the output is the dynamically calibrated weights. .
[0023] Using calibration set The lightweight probe network is trained. The training objective is to minimize the calibration error between the predicted probabilities and the synthesized labels, thereby enabling the lightweight probe network to learn to dynamically predict the optimal combination weights based on the distribution of input features.
[0024] In this embodiment, the training loss function uses the Brier Score: ,in, It is the model's predicted probability. It is the true label of the synthetic data (one-hot encoded).
[0025] Step 5: Dynamically calibrate the inference during testing.
[0026] During the testing phase, for each input test image... Perform the following steps: Step 5.1: Extract image features ; Step 5.2: Calculate the original CLIP prediction ; Step 5.3: Calculate positive sample cache prediction based on the pre-filled cache. and negative sample cache prediction ; Step 5.4, After concatenation, the data is input into a pre-trained lightweight probe network to obtain dynamic weights. and ; Step 5.5: Calculate the final calibration prediction result according to the following formula. The classification probability is then output after passing through Softmax.
[0027]
[0028] Step 5.6: Cache update.
[0029] After obtaining the prediction results, high-confidence samples are selected based on the confidence threshold. Following the traditional no-training dynamic adaptive strategy, the features of the real test samples are dynamically updated to the cache, gradually replacing or supplementing the synthetic data to achieve further adaptation to the real distribution.
[0030] In summary, this invention fundamentally solves the problems of cold-start noise and overconfidence caused by static weights by introducing synthetic data pre-filling and dynamic weight prediction. The one-time preparation phase of this invention is performed offline and requires only a very small amount of synthetic data, such as generating a few images per class. During the online inference phase, the computational overhead introduced by LPN is minimal, adding almost no inference latency, thus meeting the requirements of real-time applications.
[0031] Furthermore, the framework proposed in this invention can be seamlessly integrated as a standalone module into various existing cache-based TDA methods (such as TDA and BoostAdapter) without modifying their core architecture. The preparation phase relies entirely on synthetic data generated by the generative model, requiring no real samples from the target domain. This is significant for protecting data privacy and handling scenarios where data cannot be obtained in advance.
[0032] To verify the effects achieved by this invention, corresponding verification experiments were conducted on 15 benchmark datasets covering two major categories: fine-grained classification and natural distribution shift, as shown in Tables 1 and 2.
[0033] Table 1 compares the calibration performance of the TDA and BoostAdapter method integrating this invention with the original method on a fine-grained classification dataset, using CLIP-B / 16 (top) and CLIP-RN50 (bottom) backbone networks. ECE refers to the expected calibration error, used to quantify the model calibration performance.
[0034]
[0035] Table 2 compares the calibration performance of the TDA and BoostAdapter method integrating this invention with the original method on a fine-grained classification dataset, using CLIP-B / 16 (top) and CLIP-RN50 (bottom) backbone networks. ECE refers to the expected calibration error, used to quantify the model calibration performance.
[0036]
[0037] Table 1 shows the comparison of calibration error and accuracy between the proposed method and existing techniques on 11 fine-grained classification datasets, including Food101 and Oxford Pets, under different CLIP backbone networks. Table 2 shows the performance comparison of the proposed method on four naturally distributed offset datasets, including ImageNet-A and ImageNet-R, under different backbone networks. It can be seen that the proposed scheme can significantly reduce calibration error and maintain stable classification accuracy under different data distribution conditions.
[0038] Figure 4 This paper demonstrates the impact of the number of synthesized images in the One-Time Preparation (OTP) phase on the model calibration error on the ImageNet dataset. Experimental data show that the present invention requires only about 70 synthesized images to significantly reduce the ECE to 1.54%, and the preparation process takes only about 30 seconds. Compared to the TDA method's total inference time of about 16 minutes on the ImageNet validation set, the additional time overhead introduced by the present invention is extremely low and negligible, fully demonstrating that the method significantly improves calibration performance while perfectly preserving the efficiency of the original test-time adaptation method.
[0039] The above description is merely an embodiment of the present invention and does not constitute any limitation on the technical scope of the present invention. Therefore, any minor modifications, equivalent changes, and alterations made to the above embodiments based on the technical essence of the present invention shall still fall within the scope of the technical solution of the present invention.
Claims
1. A calibration method for adapting a visual language model during testing based on a single preparation, characterized in that, Includes the following steps: Step 1: Generate synthetic data; For each category of the target data, a synthetic image is generated using a text-to-image generative model; Step 2: Feature extraction and dataset partitioning; The image encoder of the visual language model extracts the feature vectors of all synthesized images to construct a synthetic feature set. The synthesized feature set is divided into two disjoint subsets: a pre-filled set and a pre-filled set. and calibration set ; Step 3: Cache pre-population; Using pre-filled sets and their corresponding tags Initialize and fill the positive and negative sample caches in the no-training dynamic adaptive method; Step 4: Construction and training of a lightweight probe network; Construct and train a lightweight probe network, whose input is the prediction score of the original visual language model. Predicted score for positive sample cache retrieval Predicted scores for negative sample cache retrieval The concatenated vector; the output is the dynamically calibrated weights. ; Using calibration set The lightweight detection network is trained with the goal of minimizing the calibration error between the predicted probability and the synthetic label. Step 5: Dynamically calibrate the inference during testing; During the testing phase, for each input test image... Perform the following steps: Step 5.1: Extract image features ; Step 5.2: Calculate the original CLIP prediction ; Step 5.3: Calculate positive sample cache prediction based on the pre-filled cache. and negative sample cache prediction ; Step 5.4, After concatenation, the data is input into a pre-trained lightweight probe network to obtain dynamic weights. and ; Step 5.5: Calculate the final calibration prediction result according to the following formula. The classification probability is then output after passing through Softmax.
2. The calibration method for adapting a visual language model during testing based on a single preparation, as described in claim 1, is characterized in that... In step 4, the lightweight detection network is a lightweight multilayer sensor.
3. The calibration method for adapting a visual language model during testing based on a single preparation, as described in claim 1, is characterized in that... In step 4, the training loss function for the lightweight probe network uses the Brier Score: ,in, It is the model's predicted probability. These are the real labels of the synthetic data.
4. The calibration method for adapting a visual language model during testing based on a single preparation, as described in claim 1, is characterized in that... In step 5.5, the prediction results are calibrated. as follows: .
5. The calibration method for adapting visual language model testing based on single-preparation as described in claim 1, characterized in that, The method further includes: Step 5.6: Cache update; After obtaining the prediction results, high-confidence samples are selected based on the confidence threshold. Following the traditional no-training dynamic adaptive strategy, the features of the real test samples are dynamically updated to the cache, gradually replacing or supplementing the synthetic data to achieve further adaptation to the real distribution.