A three-dimensional visual language model pruning method based on perceptual semantic distance and spatial geometric distance

By employing a two-stage pruning method based on perceptual semantic distance and spatial geometric distance, the problems of token redundancy and low inference efficiency in 3D question answering tasks are solved. This method preserves key object features and global spatial coverage under a limited budget, thereby improving the inference efficiency and accuracy of 3D question answering.

CN122289623APending Publication Date: 2026-06-26SHANGHAI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI UNIV
Filing Date
2026-05-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing visual language models suffer from issues such as redundant tokens, high inference latency, and large memory overhead in 3D question answering tasks. Furthermore, pruning based solely on semantic similarity or attention can easily lead to the loss of key object features or insufficient spatial coverage, thus affecting 3D inference performance.

Method used

By employing a two-stage pruning method based on perceptual semantic distance and spatial geometric distance, we first use attention to estimate semantic importance and retain significant tokens. Then, we select diverse tokens through an iterative farthest point sampling strategy and combine depth maps and camera parameters to establish a 3D geometric prior, ensuring key object features and global spatial coverage.

Benefits of technology

With a limited token budget, it improves inference efficiency, reduces latency and memory overhead, while maintaining the accuracy and robustness of 3D inference, making it suitable for 3D question answering tasks that rely on object recognition and cross-view consistency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289623A_ABST
    Figure CN122289623A_ABST
Patent Text Reader

Abstract

This invention discloses a 3D visual language model pruning method based on perceptual semantic distance and spatial geometric distance, relating to the fields of multimodal artificial intelligence and 3D scene understanding. The method includes: calculating the importance score of each token based on the visual encoder's self-attention matrix, sorting the tokens by score, and selecting the top n tokens as salient tokens; backprojecting the remaining tokens, along with the depth map and camera intrinsic and extrinsic parameters, onto a unified world coordinate system to construct a fusion metric that simultaneously considers semantic similarity and 3D geometric distance, and selecting m diverse tokens using a farthest point incremental sampling method; finally, concatenating the salient tokens and diverse tokens and inputting them into a language model to complete 3D question-answering reasoning. This invention can be applied to existing multi-view 3D question-answering systems without additional training, significantly reducing the number of visual tokens and reasoning latency while maintaining 3D reasoning capabilities.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of multimodal artificial intelligence and 3D scene understanding, and in particular to a 3D visual language model pruning method based on perceptual semantic distance and spatial geometric distance. Background Technology

[0002] In 3D question answering tasks, multi-view RGB images are often used as input to compensate for the scarcity of 3D data. Existing visual language models typically extract a large number of patch-level visual tokens for each viewpoint, and then concatenate them with text tokens before inputting them into a large language model for cross-modal inference. Due to repeated observations of the same scene, similar background areas, planar structures, and textures will appear repeatedly in different viewpoints, resulting in a large number of redundant visual tokens. This can easily trigger context length limits and significantly increase inference latency and memory overhead.

[0003] Existing token compression methods are mostly geared towards 2D single-image or video scenarios, and typically prune based only on semantic similarity or attention. In 3D question answering, simple semantic pruning can easily lead to the loss of key object features or insufficient spatial coverage under aggressive compression, thus affecting the accuracy of spatial relationship reasoning, cross-view consistency, and answers to location-related questions.

[0004] Therefore, a pruning method is needed to ensure semantic key features and 3D spatial coverage under a limited budget. Summary of the Invention

[0005] To address the technical problems existing in the prior art, this invention proposes a 3D visual language model pruning method based on perceptual semantic distance and spatial geometric distance. Without modifying the main parameters of the existing visual language model, it uses attention to estimate semantic importance and combines depth and camera parameters to establish a 3D geometric prior. Even during aggressive compression, it can still retain key object features and ensure global spatial coverage, thereby improving inference efficiency and maintaining 3D inference performance under a limited token budget.

[0006] To achieve the above objectives, this invention provides a 3D visual language model pruning method based on perceptual semantic distance and spatial geometric distance, comprising: Acquire multi-view RGB images of the same 3D scene, depth maps and camera parameters corresponding to each view, and obtain text questions. Input the multi-view RGB images into a visual encoder to extract visual token features and self-attention matrices for each view. The importance score of each visual token is calculated based on the self-attention matrix, and all visual tokens are sorted according to the importance score to construct a salient token set. Remove the tokens from the significant token set from the full token set, construct the three-dimensional coordinates and perform cross-view alignment on the remaining candidate tokens to obtain the three-dimensional spatial coordinates of each candidate token, and select the tokens that are complementary to the selected tokens from the remaining candidate tokens based on the fusion metric. Use an iterative farthest point sampling strategy to select a diverse token set. The salient token set and the diverse token set are merged to form the final visual token sequence. The final visual token sequence and the text question are then input into a large language model for reasoning to obtain the answer output.

[0007] Preferably, the visual token features extracted from each viewpoint are as follows: ; In the formula, This indicates the number of tokens generated for each image. Represents the token feature dimension. Indicates the first The first perspective Each token characteristic, For the first A token feature matrix from multiple perspectives.

[0008] Preferably, constructing the significant token set includes: Based on the preset total token retention budget M and significant token ratio Under the constraints, select the one ranked first. n These Tokens constitute the significant Token set, wherein... .

[0009] Preferably, the importance score of each visual token is calculated as follows: ; In the formula, For the first The importance score of each token This indicates the number of tokens generated for each image. For self-attention matrix, i For the first i Tokens.

[0010] Preferably, the three-dimensional spatial coordinates of each candidate token are obtained, including: For each candidate token, based on the corresponding image pixel region, depth map, and camera parameters, the candidate token is back-projected onto a unified world coordinate system to obtain the three-dimensional spatial coordinates of each candidate token, specifically: ; In the formula, For the first i The three-dimensional coordinates of each token. For the first i Each token corresponds to a pixel area. To iterate through each pixel in the corresponding pixel region, The pixel depth within the region. For camera internal parameters, This refers to the camera's external parameters.

[0011] Preferably, the candidate tokens are back-projected to a unified world coordinate system, including: For each candidate token, the three-dimensional world coordinates of each pixel in the image block region are calculated using the depth value, camera intrinsic matrix, and camera extrinsic matrix through the back projection transformation function. The average of the three-dimensional coordinates of all pixels in the corresponding region is then used as the three-dimensional spatial coordinates of the corresponding candidate token.

[0012] Preferably, the fusion metric includes a three-dimensional geometric distance term and a semantic similarity term; wherein, the three-dimensional geometric distance term is used to measure the Euclidean distance between the candidate token and the selected token in the same three-dimensional space, and the semantic similarity term is used to measure the cosine similarity between the feature vectors of the candidate token and the selected token.

[0013] Preferably, the fusion metric is specifically: ; In the formula, Candidate Token With the selected token The fusion distance, Used to balance spatial dispersion and semantic complementarity, Candidate Token With the selected token The three-dimensional geometric distance, For normalization, Candidate Token With the selected token Semantic similarity.

[0014] Preferably, the iterative farthest point sampling strategy includes: Initialization: Select the token with the highest importance score from the remaining candidate tokens and add it to the initially empty set of diverse tokens; Iterative selection: Calculate the minimum distance between each remaining candidate token and the current diverse token set under the fusion metric, and add the candidate token with the largest minimum distance to the diverse token set; Loop judgment: Repeat the iterative selection steps until the size of the diverse token set reaches the preset target value.

[0015] Preferably, the final visual token sequence is: ; In the formula, For the final visual token sequence, For the first The feature vector of each token For the first Each Token For a significant set of tokens, A collection of diverse tokens; The output of the answer is: ; In the formula, The predicted answer generated by the model. The input question text.

[0016] Compared with the prior art, the present invention has the following advantages and technical effects: (1) This invention addresses the issues of high token redundancy, budget constraints, and high latency in multi-view 3D question answering reasoning. It introduces a pluggable token pruning module between the visual encoder and the large language model, employing a two-stage collaborative mechanism. First, salient tokens are retained based on attention importance, prioritizing object-level features such as key objects and key regions to reduce the risk of accidental deletion of main information under strong compression. Then, diverse tokens are supplemented from the remaining candidates to cover background structures, secondary objects, and distant regions, avoiding insufficient understanding of the global scene due to retaining only local salient regions. This design is particularly suitable for 3D QA tasks that rely on object recognition, cross-view consistency, and spatial relationship reasoning, achieving more stable reasoning performance under a fixed budget.

[0017] (2) This invention utilizes depth maps and camera poses to back-project candidate tokens onto a unified world coordinate system, enabling cross-view tokens to explicitly measure their 3D proximity relationships in 3D space. The diverse token selection stage further integrates 3D distance and semantic similarity for iterative sampling, which both suppresses cross-view redundancy and preserves semantically complementary and spatially dispersed token sets, thereby maintaining the robustness of 3D structural representation and reasoning under aggressive compression conditions.

[0018] (3) This invention uses the total retention budget as a hard constraint and achieves a controllable trade-off between accuracy and efficiency through a significant proportion and fusion weight. The significant proportion determines the retention share of key feature tokens, and the fusion weight adjusts the diversity completion to be more spatially dispersed or semantically complementary. In engineering, the number and proportion of significant and diverse tokens can be directly counted, and the distribution of retained tokens in images and 3D space can be visualized to check whether key areas are covered, whether there is excessive concentration or repetition, and it has good interpretability. After pruning, the visual tokens input to the large language model are significantly reduced, thereby reducing inference latency, memory and computing power overhead, and improving throughput and deployment efficiency. Attached Figure Description

[0019] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings: Figure 1 This is a schematic diagram of the semantic-geometric joint visual token pruning process according to an embodiment of the present invention; Figure 2 This is a schematic diagram of the diverse token selection process based on perceptual semantic distance and spatial geometric distance in an embodiment of the present invention. Detailed Implementation

[0020] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0021] It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.

[0022] This embodiment proposes a 3D visual language model pruning method based on perceptual semantic distance and spatial geometric distance, such as... Figure 1 ,include: Acquire multi-view RGB images of the same 3D scene, depth maps and camera parameters corresponding to each view, and obtain text questions. Input the multi-view RGB images into a visual encoder to extract visual token features and self-attention matrices for each view. The importance score of each visual token is calculated based on the self-attention matrix, and all visual tokens are sorted according to the importance score to construct a salient token set. Remove the tokens from the significant token set from the full token set, construct the three-dimensional coordinates and perform cross-view alignment on the remaining candidate tokens to obtain the three-dimensional spatial coordinates of each candidate token, and select the tokens that are complementary to the selected tokens from the remaining candidate tokens based on the fusion metric. Use an iterative farthest point sampling strategy to select a diverse token set. The salient token set and the diverse token set are merged to form the final visual token sequence. The final visual token sequence and the text question are then input into a large language model for reasoning to obtain the answer output.

[0023] Specifically, multi-view RGB images of the same 3D scene are acquired as visual input, and depth maps and camera intrinsic and extrinsic parameters corresponding to each viewpoint are acquired simultaneously for subsequent cross-view geometric alignment and spatial measurement. Simultaneously, text questions are received. Adjustment parameters such as the token retention budget are applied. Multi-view images are input into the visual encoder to obtain patch-level visual token feature sequences for each viewpoint. A self-attention matrix is ​​extracted from the visual encoder as the basis for token importance estimation. For tokens not containing... The labeled visual encoder measures the attention level of tokens by aggregating the attention matrix column by column, thereby enabling token importance assessment and subsequent selection as a basis that can be directly inserted into existing pre-trained large language inference models without additional training.

[0024] Furthermore, the visual token features for each viewpoint are extracted as follows: ; In the formula, This indicates the number of tokens generated for each image. Represents the token feature dimension. Indicates the first The first perspective Each token characteristic, For the first A token feature matrix from multiple perspectives.

[0025] Further, the significant token set is constructed, including: Based on the preset total token retention budget M and significant token ratio Under the constraints, select the one ranked first. n These Tokens constitute the significant Token set; among which... .

[0026] Furthermore, the importance score of each visual token is calculated as follows: ; In the formula, For the first The importance score of each token This indicates the number of tokens generated for each image. For self-attention matrix, i For the first i Tokens.

[0027] Specifically, importance scores for each visual token are obtained based on attention aggregation, and all tokens are sorted in descending order of their scores. Given a total retention budget... With significant token ratio Under the constraints, determine the number of significant tokens. Then, select the top-ranked high-scoring tokens to form a significant token set. The salient token set is used to prioritize and retain tokens related to key objects, key regions, and strong structures, ensuring that information relied upon for object recognition, attribute determination, and key local details in question-answering tasks is not over-compressed or mistakenly deleted. This salient retention mechanism effectively alleviates the problem of key features being diluted within a large number of redundant background tokens in multi-view scenarios.

[0028] Furthermore, the three-dimensional spatial coordinates of each candidate token are obtained, including: For each candidate token, the candidate token is back-projected onto a unified world coordinate system based on the corresponding image pixel region, depth map and camera parameters to obtain the three-dimensional spatial coordinates of each candidate token.

[0029] Specifically, after the set of significant tokens is determined, it is removed from the total number of tokens, and only the remaining candidates are subjected to 3D coordinate construction and cross-view alignment.

[0030] Let the remaining candidate index sequence be For each perspective The For each candidate token, associate its corresponding patch pixel region with the depth map and read the pixel depth value within that region. And combined with camera internal parameters With external references The pixels are back-projected into three-dimensional space and unified to the world coordinate system through coordinate transformation: ; In the formula, For the first i The three-dimensional coordinates of each token. For the first i Each token corresponds to a pixel area. To iterate through each pixel in the corresponding pixel region, The pixel depth within the region. For camera internal parameters, This refers to the camera's external parameters.

[0031] This leads to the construction of a unified representation set across perspectives. This makes tokens from different perspectives comparable in the same world coordinate system. When two tokens are close in the world coordinate system, they often correspond to repeated observations of the same object surface or the same area from different perspectives; when they are far apart, they usually cover different spatial areas of the scene, providing a directly calculable three-dimensional distance metric for subsequent selection of various tokens.

[0032] Furthermore, the candidate tokens are back-projected onto a unified world coordinate system, including: For each candidate token, the three-dimensional world coordinates of each pixel in the image block region are calculated using the depth value, camera intrinsic matrix, and camera extrinsic matrix through the back projection transformation function. The average of the three-dimensional coordinates of all pixels in the corresponding region is then used as the three-dimensional spatial coordinates of the corresponding candidate token.

[0033] Furthermore, the fusion metric includes a three-dimensional geometric distance term and a semantic similarity term; wherein, the three-dimensional geometric distance term is used to measure the Euclidean distance between the candidate token and the selected token in the same three-dimensional space, and the semantic similarity term is used to measure the cosine similarity between the feature vectors of the candidate token and the selected token.

[0034] Specifically, diverse token completion is performed within the candidate token set to maintain scene coverage and reduce redundancy under aggressive pruning. Within the candidate token set, the token with the highest attention is used as the initialization seed for the diverse token set. Subsequently, iterative incremental sampling was used to gradually expand the range. In each iteration, for any candidate token With the selected token Calculate geometric distance and semantic similarity separately, including: Geometric distance: ; Semantic similarity: ; To simultaneously suppress both "spatial duplication" and "semantic content duplication," a fusion distance is constructed: ; In the formula, Candidate Token With the selected token The fusion distance, Used to balance spatial dispersion and semantic complementarity, Candidate Token With the selected token The three-dimensional geometric distance, For normalization, Candidate Token With the selected token Semantic similarity.

[0035] For each candidate , with its set The fusion distance of the closest elements is used as the distance from the candidate to the set. ; then select Largest candidate to join This is equivalent to selecting the token that is least repeated in the current set and best able to supplement the new spatial region or new semantic content each time. Repeat the above selection and update steps until a diverse set of tokens is reached. The number of tokens reaches the preset target. This strategy is equivalent to performing farthest-point sampling in the semantic-geometric fusion metric space, which can select tokens with more dispersed distribution and more complementary content when budget is limited, thereby improving the integrity and robustness of the overall scene expression.

[0036] Furthermore, after selecting the salient token set and the diverse token set, the two are merged to form the final retained set, constructing the final visual token sequence: ; In the formula, For the final visual token sequence, For the first The feature vector of each token For the first Each Token For a significant set of tokens, A collection of diverse tokens; Then With text issues The input is combined with a large language model to perform cross-modal reasoning, and the answer output is: ; In the formula, The predicted answer generated by the model. The input question text.

[0037] Because the number of visual tokens is constrained by the budget. This approach significantly reduces the complexity of attention computation and context usage on large language models, thereby reducing inference latency, memory overhead, and overall computational power consumption. Furthermore, this embodiment ensures that key object features are not lost through a significant attention preservation mechanism, and ensures spatial coverage and information complementarity through geometric alignment and diverse completion mechanisms. This allows for stable 3D question-answering performance while significantly compressing tokens, achieving a balance between accuracy and efficiency.

[0038] To more clearly illustrate the technical solution of the present invention, specific embodiments are provided below for description: Step 1: Multi-view observation and geometric prior acquisition: Step 1.1: Acquire a collection of multi-view RGB images of the same 3D scene. ,in, Indicates the number of viewpoints. For the first Each viewpoint. Geometric prior information, including depth maps, is acquired synchronously with the image from each viewpoint. And camera parameters. Camera parameters include at least the camera intrinsic parameter matrix. With camera external parameters ,in, For rotation matrix, It is a translation vector. It is a set of rigid body transformations involving rotation and translation.

[0039] Step 1.2, Obtaining Text Issues And set the token budget and ratio parameters for the inference side, including the total number of tokens to be retained. Significant Token Ratio ; The number of significant tokens is defined as follows: .

[0040] Step 2, Visual Feature Encoding: Step 2.1: Extract images from each viewpoint. Input a visual encoder, output patch-level visual token features: ; In the formula, This indicates the number of tokens generated for each image. Represents the token feature dimension. Indicates the first The first perspective Each token characteristic, For the first A token feature matrix from multiple perspectives.

[0041] Step 3: Calculation of significance score: Step 3.1, Attention Score Calculation: For those not included The visual encoder for the token uses the self-attention distribution within the visual encoder as the basis for token importance estimation in this embodiment.

[0042] Specifically, the attention matrix is ​​extracted from a certain layer (e.g., the last layer) of the visual encoder. ,in, Represents Token For Token Attention weights.

[0043] For those that do not exist Using tokens as a global query model, this embodiment estimates the attention level of each token by aggregating the attention matrix column-wise, i.e., the [number]th [token]. The attention value of a column is averaged across all query tokens to obtain the importance score for that token. ; In the formula, For the first The importance score of each token This indicates the number of tokens generated for each image. For self-attention matrix, i For the first i Tokens.

[0044] Importance score represents Token The average attention intensity received from other tokens can be used as an estimate of their importance in the visual representation of the current perspective, and can be used for the ranking and selection of subsequent significant tokens.

[0045] Step 3.2, Sorting and Selecting Significant Tokens: All tokens ranked by importance The index sequence is obtained by sorting in descending order. ,satisfy With the total retention budget set at... The proportion of important tokens is In this case, the number of significant tokens is taken as Then, select the aforementioned number of tokens from the beginning of the sorted sequence to form a significant token set. It is used to retain highly important visual information related to key objects or key areas.

[0046] Step 4: Diverse token selection based on perceptual semantic distance and spatial geometric distance, such as... Figure 2 : Step 4.1, Patch-level 3D coordinate estimation: First, the set of significant tokens already retained in step 3. Remove all tokens from the total pool, and align only the remaining candidate tokens using 3D coordinates. For each viewpoint... The Each token has a corresponding pixel region. Utilizing pixel depth within the region Camera internal parameters With external references By backprojecting the pixels onto 3D world coordinates and calculating the average, the 3D coordinates of the corresponding token are obtained: ; In the formula, For the first i The three-dimensional coordinates of each token. For the first i Each token corresponds to a pixel area. To iterate through each pixel in the corresponding pixel region, The pixel depth within the region. For camera internal parameters, This refers to the camera's external parameters.

[0047] Step 4.2, Construction of a Unified World Coordinate System Representation: Unify the 3D coordinates of all viewpoint tokens in step 4.1 to the same world coordinate system, and construct a set of comparable binary pairs across viewpoints: ; In the formula, A global set of 3D aligned tokens. For the first The three-dimensional position of each token in the world coordinate system. The number of input multi-view images. For the first One perspective.

[0048] Step 4.3, Candidate Set Construction and Diverse Set Initialization: The token with the highest attention score from the remaining candidates is used as the initialization seed for the diverse set. Specifically, the remaining index sequence is sorted in descending order of attention score. Choose the token with the highest attention score. Initialize a variety of token sets .

[0049] Step 4.4, Calculation of fusion distance metric: For any candidate token With the selected token Calculate the geometric distance and semantic similarity respectively, where, The geometric distance is: ; In the formula, For geometric distance, For Token Three-dimensional coordinates in the world coordinate system For Token Three-dimensional coordinates in the world coordinate system; Semantic similarity is: ; In the formula, For Token and Token semantic similarity, For Token eigenvectors, For Token eigenvectors.

[0050] To simultaneously suppress both "spatial duplication" and "semantic content duplication," a fusion distance is constructed: ; In the formula, Used to balance spatial dispersion and semantic complementarity, Candidate Token With the selected token The fusion distance, Candidate Token With the selected token The three-dimensional geometric distance, For normalization, Candidate Token With the selected token Semantic similarity.

[0051] Step 4.5: Update the token that is least repeated in the selected set: For each candidate token Calculate its relationship with the current set. Find the fusion distance of all tokens in the set and take the minimum value as the distance set. The distance represents how similar it is to the selected content. Then, the token with the largest distance is selected from all candidates and added to the set. .

[0052] Step 4.6, Iteration sampling termination condition: Repeat steps 4.4-4.5 until the size of the diverse sets meets the budget constraint: ; In the formula, The number of elements in the diverse token set.

[0053] Step 5: Cross-modal reasoning and answer generation: Step 5.1, Token Fusion: The significant token set is merged with the diverse token set to obtain the final token sequence: ; In the formula, For the final visual token sequence, For the first The feature vector of each token For the first Each Token For a significant set of tokens, A collection of diverse tokens; Step 5.2, Output the results: The final token sequence and the question text are then input into a large language model for inference. ; In the formula, The predicted answer generated by the model. The input question text.

[0054] The above are merely preferred embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A three-dimensional visual language model pruning method based on perceptual semantic distance and spatial geometric distance, characterized in that, include: Acquire multi-view RGB images of the same 3D scene, depth maps and camera parameters corresponding to each view, and obtain text questions. Input the multi-view RGB images into a visual encoder to extract visual token features and self-attention matrices for each view. The importance score of each visual token is calculated based on the self-attention matrix, and all visual tokens are sorted according to the importance score to construct a salient token set. Remove the tokens from the significant token set from the full token set, construct the three-dimensional coordinates and perform cross-view alignment on the remaining candidate tokens to obtain the three-dimensional spatial coordinates of each candidate token, and select the tokens that are complementary to the selected tokens from the remaining candidate tokens based on the fusion metric. Use an iterative farthest point sampling strategy to select a diverse token set. The salient token set and the diverse token set are merged to form the final visual token sequence. The final visual token sequence and the text question are then input into a large language model for reasoning to obtain the answer output.

2. The method of claim 1, wherein, The visual token features extracted from each viewpoint are as follows: ; In the formula, Token number generated per image, Token feature dimension, The first Token feature of the first view, The first Token feature of the first view, The first Token feature of the first view, Token feature matrix of the first view, Token feature matrix of the first view.

3. The method of claim 1, wherein, Constructing the significant token set includes: Based on the preset total token retention budget M and significant token ratio Under the constraints, select the one ranked first. n These Tokens constitute the significant Token set, wherein... .

4. The method according to claim 3, characterized in that, The importance score of each visual token is calculated as follows: ; In the formula, For the first The importance score of each token This indicates the number of tokens generated for each image. For self-attention matrix, i For the first i Tokens.

5. The method according to claim 1, characterized in that, Obtain the three-dimensional spatial coordinates of each candidate token, including: For each candidate token, based on the corresponding image pixel region, depth map, and camera parameters, the candidate token is back-projected onto a unified world coordinate system to obtain the three-dimensional spatial coordinates of each candidate token, specifically: ; In the formula, For the first i The three-dimensional coordinates of each token. For the first i Each token corresponds to a pixel area. To iterate through each pixel in the corresponding pixel region, The pixel depth within the region. For camera internal parameters, This refers to the camera's external parameters.

6. The method according to claim 5, characterized in that, Back-projecting candidate tokens to a unified world coordinate system includes: For each candidate token, the three-dimensional world coordinates of each pixel in the image block region are calculated using the depth value, camera intrinsic matrix, and camera extrinsic matrix through the back projection transformation function. The average of the three-dimensional coordinates of all pixels in the corresponding region is then used as the three-dimensional spatial coordinates of the corresponding candidate token.

7. The method according to claim 1, characterized in that, The fusion metric includes a three-dimensional geometric distance term and a semantic similarity term; wherein, the three-dimensional geometric distance term is used to measure the Euclidean distance between the candidate token and the selected token in the same three-dimensional space, and the semantic similarity term is used to measure the cosine similarity between the feature vectors of the candidate token and the selected token.

8. The method according to claim 7, characterized in that, The fusion metric is specifically: ; In the formula, Candidate Token With the selected token The fusion distance, Used to balance spatial dispersion and semantic complementarity, Candidate Token With the selected token The three-dimensional geometric distance, For normalization, Candidate Token With the selected token Semantic similarity.

9. The method according to claim 1, characterized in that, The iterative farthest point sampling strategy includes: Initialization: Select the token with the highest importance score from the remaining candidate tokens and add it to the initially empty set of diverse tokens; Iterative selection: Calculate the minimum distance between each remaining candidate token and the current diverse token set under the fusion metric, and add the candidate token with the largest minimum distance to the diverse token set; Loop judgment: Repeat the iterative selection steps until the size of the diverse token set reaches the preset target value.

10. The method according to claim 1, characterized in that, The final visual token sequence is as follows: ; In the formula, For the final visual token sequence, For the first The feature vector of each token For the first Each Token For a significant set of tokens, A collection of diverse tokens; The output of the answer is: ; In the formula, The predicted answer generated by the model. The input question text.