A visual language model alignment method based on spatial semantic reconstruction

By using a decoupled reconstruction mechanism and the interaction of learnable tokens, the modal asymmetry and computational bottleneck problems of visual language models in open-set object detection are solved, achieving efficient and robust image-text alignment and detection capabilities.

CN122244466APending Publication Date: 2026-06-19CHENGDU LINGSU INTELLIGENT TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHENGDU LINGSU INTELLIGENT TECHNOLOGY CO LTD
Filing Date
2026-03-26
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing visual language models face modal asymmetry and computational bottlenecks in open-set object detection, as well as the contradiction between training and generalization, making it difficult to learn cross-modal consistent representations with strong generalization capabilities in open-vocabulary scenarios.

Method used

By adopting a decoupled reconstruction mechanism, learnable tokens are introduced for cross-scale interaction, generating reconstructed visual sequences and calculating auxiliary spatial reconstruction loss, achieving semantic alignment between text and visual modalities, and integrating multiple loss methods to optimize model parameters and reduce dependence on labeled data.

🎯Benefits of technology

It significantly reduces computational overhead, improves the model's detection capability in open set detection, and can dynamically expand detection capability under unsupervised conditions, achieving efficient and robust image-text alignment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244466A_ABST
    Figure CN122244466A_ABST
Patent Text Reader

Abstract

This invention relates to the field of image recognition technology, and more particularly to a visual language model alignment method based on spatial semantic reconstruction. This method first extracts multi-scale features through a visual encoder, and then uses learnable tokens to interact with these multi-scale features across scales, progressively reconstructing the visual sequence from high-level semantics to low-level spatial representation. Subsequently, textual semantic embeddings are mapped to a shared space, and query and key-value sequences are generated through a parallel projector. These sequences are then bidirectionally cross-mapped with the reconstructed visual sequence, achieving cyclic semantic reconstruction between text and visual modalities. During the training phase, spatial reconstruction loss, semantic reconstruction loss, and implicit contrast alignment loss are jointly optimized to enhance intermodal consistency. During the inference phase, reparameterization techniques are used to statically fuse textual semantic features into the visual branch, causing the model to degenerate into a single-modal structure. This invention effectively solves the problems of high computational overhead in cross-modal interaction and imbalanced semantic alignment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image recognition technology, and in particular to a visual language model alignment method based on spatial semantic reconstruction. Background Technology

[0002] Visual-language models (VLMs) are a core research direction in current multimodal artificial intelligence. Their goal is to achieve accurate alignment of image features and natural language semantics in a unified representation space through the collaboration of visual encoders and language encoders. Unlike large multimodal models (MLLMs) that focus on general reasoning, VLMs place greater emphasis on spatial localization accuracy, task-specific structured outputs (such as bounding boxes and segmentation masks), and engineered inference efficiency.

[0003] However, in complex tasks such as open-set detection, existing technologies face significant challenges: modal asymmetry and computational bottlenecks. Visual inputs are highly redundant and spatially dense, while text descriptions are highly abstract and sparse. Directly performing global cross-modal interactions leads to computational overhead increasing quadratically with resolution. Furthermore, it easily leads to "intramodal attention dilution," causing the semantic guidance of the text to be overwhelmed by a large number of visual tokens. The contradiction between training and generalization: Traditional methods heavily rely on large-scale supervised annotation or simple contrastive learning (such as CLIP). However, in open-vocabulary scenarios, due to blurred category boundaries and the difficulty in constructing positive and negative samples, the model struggles to learn cross-modal consistent representations with strong generalization capabilities, and there is a semantic discrepancy between the pre-trained feature space and the downstream detection task. Summary of the Invention

[0004] To overcome the above shortcomings, this invention provides a visual language model alignment method based on spatial semantic reconstruction, aiming to achieve efficient and robust image-text alignment through a decoupled reconstruction mechanism. Without directly performing coarse-grained cross-modal interactions, "reconstruction" serves as a supervisory signal, guiding the model to learn both spatial and semantic consistency in the latent space.

[0005] This invention provides the following technical solution: a visual language model alignment method based on spatial semantic reconstruction, comprising:

[0006] S1. Obtain multi-scale visual features of the input image through a visual encoder, and obtain semantic embeddings of the text description through a text encoder;

[0007] S2. Introduce a learnable token as the initial query quantity, and perform cross-scale interaction with the multi-scale visual features. Reconstruct the visual sequence step by step from the high-level semantic features to the low-level spatial features, generate the reconstructed visual sequence and calculate the auxiliary spatial reconstruction loss.

[0008] S3. Map the semantic embedding to the shared embedding space and perform bidirectional cross-mapping with the reconstructed visual sequence to achieve semantic alignment between text and visual modalities, and calculate the auxiliary semantic reconstruction loss and spatial semantic alignment loss.

[0009] S4. The auxiliary spatial reconstruction loss, auxiliary semantic reconstruction loss, spatial semantic alignment loss and the prediction loss of the downstream task are fused and optimized to achieve high-dimensional alignment between visual spatial details and text semantics.

[0010] S5. During the model inference stage, based on the trained semantic alignment parameters, the semantic features of the text description are statically fused into the visual processing branch, so that the model degenerates into a single-modal visual structure when performing prediction tasks, thereby accelerating computation.

[0011] Preferably, the visual encoder uses the DINOv3ConvNeXt model; the text encoder is a pre-trained CLIP or BERT model.

[0012] Preferably, the multi-scale visual features include three feature maps with different resolutions, which are defined as a first resolution feature map, a second resolution feature map, and a third resolution feature map, respectively; wherein the spatial resolution of the feature maps decreases progressively from the first to the third resolution feature map.

[0013] Preferably, the steps of generating the reconstructed visual sequence and calculating the spatial reconstruction loss include:

[0014] The first, second, and third resolution feature maps are processed into tokens to generate first, second, and third visual token sequences respectively.

[0015] The learnable token and the third visual token sequence are concatenated and then input into the self-attention module for processing. Only the self-attention output result corresponding to the learnable token is retained as the first-stage query vector.

[0016] The first-stage query vector and the second visual token sequence are subjected to cosine cross attention processing to output intermediate reconstructed features.

[0017] The intermediate reconstructed features are used as query values ​​and the first visual token sequence is used as key-value pairs for causal cross-attention processing to obtain the reconstructed visual sequence.

[0018] Based on the reconstructed visual sequence and the original feature sequence, the auxiliary spatial reconstruction loss is calculated.

[0019] Preferably, the steps for achieving semantic alignment between text and visual modalities include:

[0020] The semantic embedding of the text description is mapped onto two parallel linear projectors to generate a text query sequence and a text key-value sequence;

[0021] The text query sequence and the reconstructed visual sequence are subjected to cosine cross attention processing to generate a visual token representation that is aligned with the text semantics.

[0022] After performing causal cross-attention processing on the text key-value sequence and the reconstructed visual sequence, the reconstructed text query sequence is output.

[0023] Preferably, the steps for calculating the semantic reconstruction loss and alignment loss include:

[0024] The auxiliary semantic reconstruction loss is calculated based on the consistency difference between the reconstructed text query sequence and the original text query sequence;

[0025] The spatial semantic alignment loss is calculated based on the visual token representation and the reconstructed visual sequence.

[0026] Preferably, the step of fusing and optimizing the predicted loss with that of downstream tasks includes:

[0027] The visual token representation is input into the contrastive learning head for nonlinear transformation to generate a projection semantic vector that enhances discriminativeness;

[0028] The projected semantic vector and the multi-scale visual features are subjected to implicit contrast learning at different levels to narrow the semantic distance between the semantic vector and the visual features at each scale.

[0029] The projected semantic vectors are concatenated and fused with the original visual features at each scale to construct a multi-scale fused feature sequence.

[0030] The multi-scale fused feature sequence is input into the task head, and the model parameters are optimized by jointly iterating the output prediction loss, auxiliary spatial reconstruction loss, auxiliary semantic reconstruction loss, and spatial semantic alignment loss.

[0031] Preferably, the prediction loss is fused with the auxiliary spatial reconstruction loss, the auxiliary semantic reconstruction loss, and the spatial semantic alignment loss by a weighted summation.

[0032] The present invention has the following beneficial effects:

[0033] 1. In this invention, a strong supervised signal without external annotation is constructed through two reconstruction tasks: PMT-REC (Progressive Multi-Scale Token Reconstruction) and CCQA (Circular Consistent Query Alignment). PMT-REC uses original visual features as the supervision target, forcing the model to capture fine-grained spatial structure during the generation process; CCQA establishes semantic consistency constraints through bidirectional reconstruction of text and visual tokens. This "reconstruction as alignment" paradigm enables the model to learn cross-modal associations under unsupervised conditions, significantly reducing its dependence on labeled data.

[0034] 2. In this invention, the cross-modal fusion problem is decomposed into two independent reconstruction tasks, thus avoiding global attention issues. Computational complexity. PMT-REC transforms the high redundancy of visual features into a computational advantage in local interactions through multi-scale token reconstruction; CCQA, on the other hand, confines the alignment process between text and vision to a low-dimensional semantic space through cyclic consistency constraints, rather than directly manipulating high-dimensional token sequences. This decoupled design not only reduces computational overhead but also effectively alleviates intramodal attention dilution and modal saddle point problems through the synergistic effect of local reconstruction and semantic constraints.

[0035] 3. In this invention, PMT-REC enhances the model's perception of spatial structure by reconstructing visual tokens, enabling it to more accurately locate new class targets; CCQA, on the other hand, endows the model with the ability to generalize to linguistic descriptions through text-guided semantic reconstruction. During training, the model learns cross-modal "concept transfer" rules through reconstruction tasks, thus enabling it to activate responses to unseen categories with minimal cues (such as textual descriptions) during the inference phase. This ability is particularly important in open-set detection because it allows the model to dynamically extend its detection capabilities through linguistic descriptions under unsupervised conditions. Attached Figure Description

[0036] Figure 1 This is a flowchart of a visual language model alignment method based on spatial semantic reconstruction proposed in this invention;

[0037] Figure 2 This is a schematic diagram of the PMT-REC spatial reconstruction proposed in this invention;

[0038] Figure 3 This is a schematic diagram of the CCQA semantic reconstruction and explicit alignment of spatial information proposed in this invention;

[0039] Figure 4 This is a schematic diagram of the end-to-end prompting learning overall training architecture for fusion spatial semantic reconstruction proposed in this invention;

[0040] Figure 5This is a schematic diagram of the inference stage process proposed in this invention; where (a) is before reparameterization; and (b) is after reparameterization. Detailed Implementation

[0041] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0042] Example 1

[0043] In a first embodiment of the present invention, the present invention provides a visual language model alignment method based on spatial semantic reconstruction, such as... Figure 1 As shown, it includes the following steps:

[0044] S1. Obtain multi-scale visual features of the input image through a visual encoder, and obtain semantic embeddings of the text description through a text encoder;

[0045] Preferably, the visual encoder uses the DINOv3 ConvNeXt model; the text encoder is a pre-trained CLIP or BERT model.

[0046] Preferably, the multi-scale visual features include three feature maps with different resolutions, which are defined as a first resolution feature map, a second resolution feature map, and a third resolution feature map, respectively; wherein the spatial resolution of the feature maps decreases progressively from the first to the third resolution feature map.

[0047] Specifically, such as Figure 2 As shown, this part is defined as the PMT-Rec module, which uses DINOv3 ConvNeXt as the text encoder and extracts multi-level visual features as input. It achieves joint alignment of spatial and semantic information through progressive reconstruction of learnable tokens. For example... Figure 2 As shown, the original image The DINOv3 ConvNeXtDistilling model was used to extract three feature maps at different resolutions:

[0048] ;

[0049] ;

[0050] ;

[0051] in, These are the height and width of the feature map, respectively. This indicates the downsampling process from shallow to high layers.

[0052] S2. Introduce a learnable token as the initial query quantity, and perform cross-scale interaction with the multi-scale visual features. Reconstruct the visual sequence step by step from the high-level semantic features to the low-level spatial features, generate the reconstructed visual sequence and calculate the auxiliary spatial reconstruction loss.

[0053] Preferably, the steps of generating the reconstructed visual sequence and calculating the spatial reconstruction loss include:

[0054] The first, second, and third resolution feature maps are processed into tokens to generate first, second, and third visual token sequences respectively.

[0055] The learnable token and the third visual token sequence are concatenated and then input into the self-attention module for processing. Only the self-attention output result corresponding to the learnable token is retained as the first-stage query vector.

[0056] The first-stage query vector and the second visual token sequence are subjected to cosine cross attention processing to output intermediate reconstructed features.

[0057] The intermediate reconstructed features are used as query values ​​and the first visual token sequence is used as key-value pairs for causal cross-attention processing to obtain the reconstructed visual sequence.

[0058] Based on the reconstructed visual sequence and the original feature sequence, the auxiliary spatial reconstruction loss is calculated.

[0059] Specifically, each feature map is then tokenized as a sequence:

[0060] ;

[0061] in , For the embedding dimension, a set of learnable tokens is introduced. (For example, K=64), serving as a starting prompt or seed for the reconstruction process. These tokens are associated with... After concatenating the feature tokens, they are input into the Self-Attention module to obtain the self-attention learning results:

[0062] ;

[0063] Then only the learnable tokens portion is retained, denoted as... This is used for subsequent cross-scale interactions.

[0064] use As a query, P4 feature token As the key-value pair, cosine cross-attention is calculated, denoted as:

[0065] ;

[0066] Further As a query, P3 feature token As a key-value pair, but employing causal constraints (causalmask) to achieve autoregressive reconstruction:

[0067] ;

[0068] Among them, the mask ensures the first Each token can only rely on the previous one. Each token is generated in a specific order.

[0069] The final reconstructed P3 features are obtained by fusing two parts:

[0070] ;

[0071] in This is an approximate version of the original P3 features after patchy up-sampling. Reconstruction quality is supervised using MSE loss.

[0072] ;

[0073] This loss function is called the auxiliary space reconstruction loss, which is used to enhance the model's ability to model spatial details.

[0074] S3. Map the semantic embedding to the shared embedding space and perform bidirectional cross-mapping with the reconstructed visual sequence to achieve semantic alignment between text and visual modalities, and calculate the auxiliary semantic reconstruction loss and spatial semantic alignment loss.

[0075] Preferably, the steps for achieving semantic alignment between text and visual modalities include:

[0076] The semantic embedding of the text description is mapped onto two parallel linear projectors to generate a text query sequence and a text key-value sequence;

[0077] The text query sequence and the reconstructed visual sequence are subjected to cosine cross attention processing to generate a visual token representation that is aligned with the text semantics.

[0078] After performing causal cross-attention processing on the text key-value sequence and the reconstructed visual sequence, the reconstructed text query sequence is output.

[0079] Preferably, the steps for calculating the semantic reconstruction loss and alignment loss include:

[0080] The auxiliary semantic reconstruction loss is calculated based on the consistency difference between the reconstructed text query sequence and the original text query sequence;

[0081] The spatial semantic alignment loss is calculated based on the visual token representation and the reconstructed visual sequence.

[0082] Specifically, this part is defined as the CCQA module, which aims to achieve bidirectional semantic alignment between visual and linguistic modalities through cycle-consistent query alignment. The input is a sequence of reconstructed tokens generated from the multi-scale features F3, F4, and F5 of the image via PMT-REC. (e.g., K=64), and semantic embedding of text descriptions (e.g., "Hair, Building, Bag, Skirt...") after BERT / CLIP encoding. .

[0083] like Figure 3 As shown, firstly, Mapped to a shared embedding space via two parallel linear projectors:

[0084] ;

[0085] in , As a learnable projection layer, the output dimension matches the visual token, serving as the query and key-value pair, respectively. Subsequently, bidirectional cross-attention is performed to achieve semantic reconstruction.

[0086] From Text to Visual: Utilizing As a query, As a key-value pair, perform cosine cross-attention calculation:

[0087] ;

[0088] Generate visual token representations that are aligned with the semantics of the text.

[0089] From Visual to Text: Utilizing As a query, As a key-value pair, and by applying a causal mask, autoregressive semantic generation is achieved:

[0090] ;

[0091] The output is the reconstructed text query sequence.

[0092] To enhance alignment consistency, CCQA is introduced.

[0093] With the original Perform MSE comparison and construct auxiliary semantic reconstruction loss:

[0094] ;

[0095] This is called auxiliary semantic reconstruction loss. Simultaneously, it is supervised by Contrastive Loss. With the original Semantic consistency is defined as:

[0096]

[0097] in For temperature parameters, The negative sample set represents the semantic alignment loss in this space. Overall, the CCQA module achieves precise alignment of visual and linguistic features at the token level through bidirectional mapping and cyclic consistency constraints, significantly enhancing semantic consistency and robustness in image-text prompt learning.

[0098] S4. The auxiliary spatial reconstruction loss, auxiliary semantic reconstruction loss, spatial semantic alignment loss and the prediction loss of the downstream task are fused and optimized to achieve high-dimensional alignment between visual spatial details and text semantics.

[0099] Preferably, the step of fusing and optimizing the predicted loss with that of downstream tasks includes:

[0100] The visual token representation is input into the contrastive learning head for nonlinear transformation to generate a projection semantic vector that enhances discriminativeness;

[0101] The projected semantic vector and the multi-scale visual features are subjected to implicit contrast learning at different levels to narrow the semantic distance between the semantic vector and the visual features at each scale.

[0102] The projected semantic vectors are concatenated and fused with the original visual features at each scale to construct a multi-scale fused feature sequence.

[0103] The multi-scale fused feature sequence is input into the task head, and the model parameters are optimized by jointly iterating the output prediction loss, auxiliary spatial reconstruction loss, auxiliary semantic reconstruction loss, and spatial semantic alignment loss.

[0104] Preferably, the prediction loss is fused with the auxiliary spatial reconstruction loss, the auxiliary semantic reconstruction loss, and the spatial semantic alignment loss by a weighted summation.

[0105] Specifically, during the training phase, deep collaborative modeling of image and text information is achieved by jointly optimizing spatial reconstruction, semantic alignment, and downstream task losses. The overall process is as follows: Figure 4As shown.

[0106] First, input an image. Multi-scale features F3, F4, and F5 are extracted via DINOv3 ConvNeXt Distilling and then fed into the PMT-REC module for spatial reconstruction, generating a reconstructed visual token sequence. Simultaneously calculate the auxiliary space reconstruction loss. Meanwhile, text descriptions (such as "Hair, Building, Bag, Skirt...") are encoded using BERT / CLIP, mapped to a shared embedding space through two independent linear projectors, and then input to the CCQA module. CCQA outputs two key results:

[0107] Reconstructed semantics It has integrated spatial structure from images with textual semantics;

[0108] Auxiliary semantic reconstruction loss Spatial-semantic alignment loss .

[0109] Subsequently, It is fed into a comparison task head, which performs a nonlinear transformation on it to enhance discriminative power:

[0110] ;

[0111] Next, Implicit contrast alignment is performed between the semantic tokens and the original multi-scale visual features F3, F4, and F5 at corresponding levels. This contrastive learning narrows the semantic distance between the semantic tokens and visual features, thereby achieving cross-modal knowledge transfer. The aligned features are then concatenated with the original features.

[0112] ;

[0113] in Different levels of dimensions are adapted through learnable projection or repeated expansion. Ultimately, the fused features... , , Input the data into the standard detection-segmentation task header to perform object detection and semantic segmentation tasks, and calculate the corresponding supervised loss. .

[0114] The total loss function is the sum of three parts:

[0115] ;

[0116] in The balancing coefficient is used. This training paradigm implements a "dual reconstruction" strategy: explicit reconstruction (PMT-REC) enhances spatial awareness, and implicit alignment (Contrastive Head + Concat) improves semantic consistency, ultimately significantly improving the learning performance of image-text prompts on detection and segmentation tasks.

[0117] S5. During the model inference stage, based on the trained semantic alignment parameters, the semantic features of the text description are statically fused into the visual processing branch, so that the model degenerates into a single-modal visual structure when performing prediction tasks, thereby accelerating computation.

[0118] Example 2

[0119] One of the core design goals of this invention is to achieve a balance between "strong expressiveness during training and high efficiency during inference." To this end, we propose a two-stage inference mechanism: the first stage achieves non-intrusive image-text fusion through semantic approximation, preserving flexibility; the second stage utilizes the static characteristics of text prompts to complete the structural degradation from a multimodal model to a pure visual model, achieving ultimate inference performance. These two stages together constitute a complete closed loop from research to implementation of DSSR.

[0120] Phase 1: Non-intrusive suggestion fusion (before reparameterization)

[0121] In standard inference scenarios, the PMT-REC and CCQA modules are frozen and removed from the forward computation graph, no longer participating in the actual inference process. The key to this design is that during the training phase... and The loss term enforces a high degree of consistency between the semantic tokens reconstructed by visual guidance and the semantics of the original text. Therefore, the original text encoding can be safely used during inference. Generated after projection As a high-quality approximation of the fused semantic information.

[0122] The specific process is as follows: Figure 5 As shown in image a: Input image Multi-scale features F3, F4, and F5 are extracted using the DINOv3 ConvNeXt Distilling backbone. Simultaneously, predefined text prompts (such as "Hair, Building, Bag, Skirt...") are encoded into semantic embeddings using BERT or CLIP, and then used to generate queries via two independent linear projection layers. The query is fed into a lightweight contrastive head (typically a two-layer MLP) for semantic enhancement and spatial adaptation, outputting... .

[0123] Subsequently, Cross-modal stitching with visual features F3, F4, and F5 at each level:

[0124] ;

[0125] in, express Broadcast to the spatial dimension of the corresponding feature map ( This ensures channel alignment. Finally, the fused features are fed into a standard detection-segmentation task head (such as the traditional YOLO, Mask2Former, or DETR decoder) to complete the dense prediction task.

[0126] The advantage of this stage is that it retains the image-text alignment ability learned during training, while avoiding complex token reconstruction and cross-attention calculations, significantly reducing latency and memory usage, and making it suitable for most online service scenarios.

[0127] Phase Two: Fully Reparameterized (Ultimate Inference Optimization)

[0128] In practical deployments, the text cue set is often fixed (e.g., 80 / 600 / 1k categories in open vocabulary detection), which provides a key opportunity for further model optimization. We propose to fully reparameterize the text branch: the learnable parameters of the text encoder, linear projection layer, and contrastive head are equivalently fused into the corresponding layers of the visual backbone network.

[0129] The specific process is as follows: Figure 5 As shown in b:

[0130] 1. Encode all predefined category texts in advance to generate a fixed set of semantic tokens. ;

[0131] 2. Treat the contrastive head as a decomposable nonlinear mapping. Its weights can be merged into subsequent feature fusion layers through structural rewriting;

[0132] 3. When loading the model, The output is used as a "bias" or "fit vector" and directly injected into In the channel dimension, parameterized fusion features are formed;

[0133] 4. Ultimately, the entire model only requires an image as input during inference. This allows for the output of detection and segmentation results, completely independent of text input. Essentially, this process transforms "conditional prompts" into "built-in knowledge," reducing the multimodal model of this invention to a text-aware visual model. Its structure is completely consistent with the traditional CNN / Transformer backbone, allowing seamless integration into existing vision systems and supporting full optimization of high-performance inference engines such as Tensor RT and ONNX Runtime.

[0134] In summary, the inference design of this invention embodies the advanced concept of "dynamic training and static deployment": during the training phase, it fully utilizes language supervision to construct a robust semantic space; during the inference phase, it employs semantic approximation and reparameterization techniques to achieve a smooth transition from multimodal interaction to pure visual execution, balancing performance and efficiency. This provides a solution for open vocabulary perception tasks that is both expressive and practical. More importantly, the fused query tokens of this invention have the potential for further reinforcement of self-supervised learning, and can break free from the strong dependence on labeled data through the GRPO training paradigm.

[0135] Finally, it should be noted that the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A visual language model alignment method based on spatial semantic reconstruction, characterized in that, include: S1. Obtain multi-scale visual features of the input image through a visual encoder, and obtain semantic embeddings of the text description through a text encoder; S2. Introduce a learnable token as the initial query quantity, and perform cross-scale interaction with the multi-scale visual features. Reconstruct the visual sequence step by step from the high-level semantic features to the low-level spatial features, generate the reconstructed visual sequence and calculate the auxiliary spatial reconstruction loss. S3. Map the semantic embedding to the shared embedding space and perform bidirectional cross-mapping with the reconstructed visual sequence to achieve semantic alignment between text and visual modalities, and calculate the auxiliary semantic reconstruction loss and spatial semantic alignment loss. S4. The auxiliary spatial reconstruction loss, auxiliary semantic reconstruction loss, spatial semantic alignment loss and the prediction loss of the downstream task are fused and optimized to achieve high-dimensional alignment between visual spatial details and text semantics. S5. During the model inference stage, based on the trained semantic alignment parameters, the semantic features of the text description are statically fused into the visual processing branch, so that the model degenerates into a single-modal visual structure when performing prediction tasks, thereby accelerating computation.

2. The visual language model alignment method based on spatial semantic reconstruction according to claim 1, characterized in that, The visual encoder uses the DINOv3ConvNeXt model; the text encoder is a pre-trained CLIP or BERT model.

3. The visual language model alignment method based on spatial semantic reconstruction according to claim 1, characterized in that, The multi-scale visual features include three feature maps with different resolutions, which are defined as a first resolution feature map, a second resolution feature map, and a third resolution feature map, respectively; wherein the spatial resolution of the feature maps decreases progressively from the first to the third resolution.

4. The visual language model alignment method based on spatial semantic reconstruction according to claim 3, characterized in that, The steps for generating the reconstructed visual sequence and calculating the spatial reconstruction loss include: The first, second, and third resolution feature maps are processed into tokens to generate first, second, and third visual token sequences respectively. The learnable token and the third visual token sequence are concatenated and then input into the self-attention module for processing. Only the self-attention output result corresponding to the learnable token is retained as the first-stage query vector. The first-stage query vector and the second visual token sequence are subjected to cosine cross attention processing to output intermediate reconstructed features. The intermediate reconstructed features are used as query values ​​and the first visual token sequence is used as key-value pairs for causal cross-attention processing to obtain the reconstructed visual sequence. Based on the reconstructed visual sequence and the original feature sequence, the auxiliary spatial reconstruction loss is calculated.

5. The visual language model alignment method based on spatial semantic reconstruction according to claim 1, characterized in that, The steps to achieve semantic alignment between text and visual modalities include: The semantic embedding of the text description is mapped onto two parallel linear projectors to generate a text query sequence and a text key-value sequence; The text query sequence and the reconstructed visual sequence are subjected to cosine cross attention processing to generate a visual token representation that is aligned with the text semantics. After performing causal cross-attention processing on the text key-value sequence and the reconstructed visual sequence, the reconstructed text query sequence is output.

6. The visual language model alignment method based on spatial semantic reconstruction according to claim 5, characterized in that, The steps for calculating semantic reconstruction loss and alignment loss include: The auxiliary semantic reconstruction loss is calculated based on the consistency difference between the reconstructed text query sequence and the original text query sequence; The spatial semantic alignment loss is calculated based on the visual token representation and the reconstructed visual sequence.

7. The visual language model alignment method based on spatial semantic reconstruction according to claim 6, characterized in that, The steps for fusing and optimizing the predicted loss with that of downstream tasks include: The visual token representation is input into the contrastive learning head for nonlinear transformation to generate a projection semantic vector that enhances discriminativeness; The projected semantic vector and the multi-scale visual features are subjected to implicit contrast learning at different levels to narrow the semantic distance between the semantic vector and the visual features at each scale. The projected semantic vectors are concatenated and fused with the original visual features at each scale to construct a multi-scale fused feature sequence. The multi-scale fused feature sequence is input into the task head, and the model parameters are optimized by jointly iterating the output prediction loss, auxiliary spatial reconstruction loss, auxiliary semantic reconstruction loss, and spatial semantic alignment loss.

8. The visual language model alignment method based on spatial semantic reconstruction according to claim 7, characterized in that, The prediction loss is fused with the auxiliary spatial reconstruction loss, auxiliary semantic reconstruction loss, and spatial semantic alignment loss through a weighted summation.