Tree index based multi-modal retrieval enhancement generation optimization method in the field of aviation
By constructing a tree-based index for multimodal retrieval enhancement in the aviation field, this method addresses the issues of low recall and poor semantic understanding of large language models in multimodal long documents within the aviation domain. It enables efficient and reliable multimodal information interaction, improving the accuracy and controllability of knowledge retrieval and content generation in the aviation field.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA AERO POLYTECH ESTAB
- Filing Date
- 2025-11-18
- Publication Date
- 2026-06-19
Smart Images

Figure CN121636758B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of large-scale model retrieval enhancement technology in the aviation field, specifically to a tree-index-based method for generating and optimizing multimodal retrieval enhancement in the aviation field. Background Technology
[0002] In recent years, Large Language Models (LLMs) based on the Transformer architecture have made groundbreaking progress in the field of Natural Language Processing (NLP). These models, through self-supervised pre-training on large-scale text data, have demonstrated excellent language understanding and generation capabilities, and are widely used in tasks such as dialogue systems, automatic summarization, machine translation, and code generation. However, despite their good language modeling capabilities in general contexts, large language models still face a series of challenges in specific domains, tasks with high accuracy requirements, or knowledge-intensive scenarios, particularly in terms of limitations regarding knowledge boundaries, factual consistency, and controllability.
[0003] A major problem is that LLMs' knowledge sources are limited to the information contained in their training corpora. Therefore, when faced with queries outside the training corpus or long-tail domain problems, the models are prone to a phenomenon known as "hallucination"—generating linguistically fluent but factually incorrect or fabricated content. This behavior can have serious consequences in scenarios with extremely high requirements for factual accuracy, such as medicine, law, finance, and scientific research, limiting the practical usability of large language models in high-credibility applications.
[0004] To address these issues, academia and industry have proposed the Retrieval-Augmented Generation (RAG) framework. RAG decouples external knowledge retrieval mechanisms from the language model, introducing an external, updatable, structured, or unstructured knowledge base as a source of supplementary facts. Before generating a response, the system first semantically encodes the user's query and retrieves relevant document fragments or entries from the knowledge base through similarity calculations. These fragments, along with the original input, are then fed into the language model to generate the response. By explicitly introducing relevant knowledge, RAG effectively improves the model's performance in terms of factual consistency, interpretability, and cross-domain generalization, while significantly reducing the error generation rate.
[0005] The advantages of RAG lie not only in improving generation accuracy, but also in the flexibility and scalability brought by its modular architecture. Compared to directly "fixing" knowledge in a large language model, dynamic updates and multi-source integration of knowledge can be achieved through external retrieval mechanisms. For example, combining RAG architecture with search engines can enable more real-time and broader-coverage question-and-answer services; in enterprise knowledge management systems, RAG can be used to build intelligent assistants that support dynamic business changes. In addition, RAG is also widely used in intelligent customer service, scientific research writing assistance, educational Q&A, and many other fields.
[0006] While current RAG methods have achieved significant results in practical applications, several technical challenges remain, such as the granularity of retrieved documents, semantic matching of knowledge and queries, the ability to fuse multimodal information, and the efficiency and controllability of utilizing retrieved information during the generation process. Especially in multimodal scenarios, user queries may involve non-textual information such as images and tables, which traditional text retrieval and generation frameworks struggle to handle directly. Therefore, constructing an efficient and controllable retrieval enhancement generation system oriented towards multimodal scenarios has become an important direction in current research and application.
[0007] In summary, to address the illusion problem of large language models in specialized knowledge scenarios and further enhance their multimodal interaction capabilities, it is urgent to propose an enhanced generation method that integrates multimodal retrieval mechanisms to achieve high-quality natural language responses in complex information environments. The development of this technology will not only help promote the evolution of artificial intelligence systems towards specialization and credibility but also provide a solid technical foundation for applications such as intelligent question-answering systems, search engines, and multimodal human-computer interaction. Summary of the Invention
[0008] To address the shortcomings of existing technologies, this invention aims to provide an optimized method for generating multimodal retrieval enhancements in the aviation field based on tree indexes. The method improves the index building and score calculation parts of traditional retrieval enhancement generation to alleviate the problems of low recall and poor understanding of multi-level semantic information in RAG systems when faced with long, multimodal documents in the aviation field, thus meeting the increasingly complex needs of the aviation industry. It can build a tree index on the original knowledge base, fusing adjacent information in the knowledge base to alleviate the problem of lost coherent semantic information caused by the granularity of segmentation in traditional RAG systems. By introducing a prior score for long documents, a likelihood method is used to adjust the similarity score, eliminating the influence of noise and irrelevant content.
[0009] Specifically, this invention provides a tree-index-based method for enhancing and optimizing multimodal retrieval in the aviation field, comprising the following steps:
[0010] S1. Construct the source dataset for a multimodal knowledge base in the aviation field;
[0011] S2. Tree index construction and retrieval based on the embedding model, specifically including:
[0012] S21. Construct a tree index based on the embedding model: Use the aviation multimodal pages obtained in S1 as the bottom-level nodes for representation, and complete node aggregation and tree index construction from bottom to top;
[0013] S22. Convert the input natural language query into a query embedding, perform hierarchical semantic retrieval on a tree index structure, and obtain the retrieval candidate set as follows: ,in, This is the set of candidate nodes obtained using a greedy strategy. Given the candidate node set obtained using a fixed-width strategy, retrieve the candidate set. That is, the set of nodes that are input to the subsequent score optimization module;
[0014] S3. Perform multi-dimensional score optimization on the retrieval candidate set, specifically including:
[0015] S31. Score optimization based on likelihood value, the formula is as follows:
[0016] ;
[0017] in, This is the original search score. This is a likelihood score based on the spatial distribution pattern of similarity between pages in aviation documents. The fusion coefficient is... The score after likelihood optimization;
[0018] S32. Path context-based score optimization, the formula is as follows:
[0019] ;
[0020] in, The score is optimized for the context. The embedding vector of the input aviation business query; This represents the embedding of a document page within the underlying nodes. For this page, the first in the tree The parent node of the layer is embedded; This is the path decay coefficient, used to adjust the weight of the context scores at each layer, controlling the degree of influence of higher-level information on the score; For the first tree index layer;
[0021] S4. The scores of multiple dimensions are fused and sorted to obtain a set of relevant contexts to guide the generation of question-and-answer results, thus completing the optimization of multimodal retrieval enhancement in the aviation field.
[0022] Preferably, the tree index constructed in step S21 is as follows:
[0023] ;
[0024] ;
[0025] in, For the first Layer A collection of image block nodes, ; Indicates taking The smaller one, This is the aggregation step size parameter; The number of nodes; express The set of child nodes; for In the The corresponding layer Number of child nodes After the termination condition is met, a tree index diagram is obtained. , It is the union of nodes at each level in the tree index.
[0026] Preferably, step S1 specifically includes: collecting relevant standards, airworthiness provisions, maintenance manuals, design manuals, and test reports in the aviation field, and parsing each document into a page-based image organization format to obtain a multimodal knowledge base in the aviation field. The source dataset, ;
[0027] in, For the knowledge base Page 1 This represents the total number of pages in all documents of the knowledge base.
[0028] Preferably, step S21 specifically includes the following sub-steps:
[0029] S211, Bottom-level node representation: Knowledge base Each page image in Input multimodal vector embedding model Extract the vector embedding representation of each page image. From page images Its vector embedding Each node that forms the bottom layer of the tree index ,Right now:
[0030] ;
[0031] S212, For the current layer top-level node set The number of nodes is According to the set aggregation step size parameters , will the current layer Adjacent The images in each child node are stitched together to generate... The first The layer consists of aggregated image blocks, each constructed as follows:
[0032] ;
[0033] Initially, ; For image stitching operations; For the first The layers are spliced together. The first layer image obtained A collection of image patches, Indicates the first The layer is used to construct the first The first layer A collection of images of aggregated image patches;
[0034] S213, will Each aggregated image patch of the layer is input into a multimodal vector embedding model. Extract its semantic vector representation:
[0035] ;
[0036] The resulting node representation is as follows:
[0037] ;
[0038] No. The node set of the layer is:
[0039] , ;
[0040] S214, Record each generated first... Layer nodes Corresponding child node list This establishes parent-child connections, forming the set of directed edges in the tree index. ;
[0041] S215. Repeat steps S212-S214 until any termination condition is met, at which point the tree index diagram is obtained. .
[0042] Preferably, in step S212, when When fixed, from index Starting with the image at index , select sequentially up to index . Images, enabling up to [number] clicks Aggregate in groups; when Cannot be When divisible, the upper bound index of the last group is That is, only the remaining sub-images are stitched together;
[0043] The termination condition in step S215 is: the current number of top-level nodes is less than the threshold. Or the number of tree levels reaches a preset upper limit threshold. .
[0044] Preferably, step S22 specifically includes the following sub-steps:
[0045] S221. Natural Language Query Embedding: Embedding natural language queries entered by aviation business personnel. After preprocessing, the data is input into the multimodal vector embedding model. In this process, the corresponding embedding vector is obtained. In a tree index, let the root node be... The child nodes of the root node are , Indicates the first The first layer Each node has an embedded representation as follows: ;
[0046] S222. A hybrid hierarchical retrieval scheme combining greedy search and fixed-width search is adopted to obtain the candidate set, specifically:
[0047] S2221. Using a greedy strategy, at each level, select the node most similar to the semantics of the query embedding vector from the current position, and recursively form a query path with the highest relevance layer by layer, starting from the root node. Begin by calculating the current layer layer by layer. All nodes and query vectors cosine similarity score :
[0048] , ;
[0049] in, For the first Total number of nodes in the layer;
[0050] Nodes are added to a priority queue based on their similarity scores, and then the node with the highest score is selected as the best matching node for that layer. :
[0051] ;
[0052] Then the child nodes of this node As the candidate set for the next layer, repeat step S2221 to eventually form a greedy path:
[0053] ;
[0054] in, The number of levels in the tree; Indicates from the first Layer to the first The best matching node of the layer, If the highest similarity score in the current layer is lower than a preset threshold, the search will stop prematurely, stopping the search at that layer. The best matching node is denoted as the end node. ;
[0055] S2222. Introduce a fixed-width strategy to perform horizontal expansion at each layer. For the first layer... Select all nodes in the layer that are related to... The most similar The nodes form a fixed-width candidate set. :
[0056] ;
[0057] in, This means taking the top two with the highest similarity. There are 10 candidate nodes, and then all the child nodes of these candidate nodes are gathered into the node set of the next layer. Repeat step S2222 until the query reaches the bottom leaf node, finally obtaining the union of the terminal nodes. :
[0058] ;
[0059] in, This indicates the range from the 1st floor to the 2nd floor. All fixed-width candidate sets of the layer Perform a union operation to gather the total set of candidate nodes;
[0060] S223, Set the greedy strategy node set With fixed-width strategy node set union As a candidate set for retrieval.
[0061] Preferably, step S4 specifically includes the following sub-steps:
[0062] S41. Constructing the fusion score: Based on the multi-dimensional scores of each candidate node obtained in step S3, the final score is calculated using the fusion formula:
[0063] ;
[0064] in, For weight adjustment parameters, The final score after fusion;
[0065] S42. Diverse Reordering Strategy: For the obtained candidate node set And the score obtained after merging with it through step S41 The pages are reordered to obtain a set of relevant pages used to guide the generation of subsequent questions and answers. , The total number of candidate nodes; this step specifically includes:
[0066] S421. Initialize the final candidate set. In each round, candidate nodes are selected from the set using the following heuristic objective function. Select the node with the highest score and the lowest similarity to the already selected nodes. and join ;
[0067] ;
[0068] in, Candidate nodes The fusion score; The weight of the redundancy penalty term; Candidate nodes and Vector cosine similarity; express With the present The maximum similarity between the most similar nodes in a set;
[0069] S422. Repeat step S421 until the set is complete. Once the number of nodes reaches the preset upper limit for the number of candidates, a set of relevant pages is obtained to guide the generation of subsequent question-and-answer questions. ;
[0070] S43. Generate question and answer results: Combined with user natural language queries The input is fed into a multimodal large language model for generative reasoning, resulting in the final aviation question-answering results.
[0071] Preferably, step S31 specifically includes the following sub-steps:
[0072] S311. Construct the log-likelihood expression as follows:
[0073] ;
[0074] in, Indicates the number of candidate nodes; let the distance from the target answer page be... Candidate pages of a page, with scores following a variance of . normal distribution ,function The similarity score indicates that the score varies with the score. The expected value of the change;
[0075] S312. By reconstructing the likelihood score and fusing it with the weights, we obtain a score optimization formula based on the likelihood value:
[0076] The optimization objective is:
[0077] ;
[0078] Introducing hyperparameters Adjust the relative weights of the linear and squared terms to construct the likelihood score:
[0079] ;
[0080] Finally, the fusion coefficient is introduced. The original score and the likelihood score are linearly combined to obtain the final adjusted score.
[0081] Preferably, step S32 specifically includes the following sub-steps:
[0082] S321, Path Context Embedding Representation: For a specific page of a document Let the embedding sequence from the leaf node representing this node to the top-level node be:
[0083] ;
[0084] in, This is an embedded representation of the page itself; This indicates that the node is the [number]th node in the tree. The parent node of the layer is embedded. ;
[0085] S322, Path Context Score: Introducing a path weighted fusion mechanism, attenuation weights are added to embeddings at different levels to obtain the context score formula.
[0086] Preferably, in step S2221, the terminal nodes are leaf-mapped:
[0087] .
[0088] The above formula yields This is the set of candidate pages for the greedy strategy.
[0089] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0090] (1) This invention constructs a hierarchical index by using pages as the basic unit of the RAG retrieval system and integrates adjacent content. Page images serve as a unified carrier to simultaneously carry multimodal elements such as text, tables, curves, schematic diagrams / schematic diagrams, formulas and layout structures. It preserves the structure and semantics of cross-pages, avoids semantic fragmentation and text-image alignment errors caused by pure text segmentation, and achieves efficient organization and retrieval of aviation data, thereby providing users with a richer information interaction experience.
[0091] (2) This invention implements a hierarchical retrieval strategy on a tree index that takes into account both relevance and coverage, thereby achieving efficient location and semantic expansion of long documents. Under the premise of controllable computing resources, it can quickly lock the suspected answer area, ensure full coverage of various types of evidence such as standard clauses, maintenance procedures, and diagram descriptions, solve the problems of insufficient recall and high off-topic rate of traditional RAG in complex multimodal data, and improve the stability and applicability of the system in aviation question answering and knowledge retrieval.
[0092] (3) This invention optimizes the scores of candidate pages by combining the location information patterns of documents with the path context to suppress redundant and noisy pages, thereby achieving more accurate and reliable result ranking. It can effectively reduce the risk of illusion generated by RAG, improve the consistency and controllability of answers, and thus meet the high precision and high reliability requirements of the aviation field for knowledge retrieval and content generation. Attached Figure Description
[0093] Figure 1 This is a schematic diagram of the overall process of the present invention;
[0094] Figure 2 This is a flowchart of the method of the present invention;
[0095] Figure 3 This is a schematic diagram of the tree index graph construction process of the present invention. Detailed Implementation
[0096] Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.
[0097] Specifically, this invention provides a tree-index-based method for enhancing and optimizing multimodal retrieval in the aviation field, such as... Figures 1 to 3 As shown, it includes the following steps:
[0098] S1. Construct the source dataset for a multimodal knowledge base in the aviation field. Specific steps include: collecting relevant aviation standards, airworthiness regulations, maintenance manuals, design manuals, test reports, and other documents; parsing each document into a page-based image-organized format to obtain the multimodal knowledge base for the aviation field. The source dataset, ;
[0099] in, For the knowledge base Page 1 This represents the total number of pages in all documents of the knowledge base.
[0100] S2. Building and retrieving a tree-based index based on the embedding model, specifically including the following sub-steps:
[0101] S21. Construct a tree index based on the embedding model: Use the aviation multimodal page obtained in step S1 as the bottom-level node, and complete node aggregation and tree index construction from bottom to top. The constructed tree index is as follows:
[0102] ;
[0103] ;
[0104] in, For the first Layer A collection of image block nodes, ; Indicates taking and The smaller one, This is the aggregation step size parameter; The number of nodes; express The set of child nodes; for In the The corresponding layer Number of child nodes After the termination condition is met, a tree index diagram is obtained. , It is the union of nodes at each level in the tree index.
[0105] Step S21 specifically includes the following sub-steps:
[0106] S211, Bottom-level node representation: Knowledge base Each page image in Input multimodal vector embedding model Extract its vector embedding representation From page images Its vector embedding Each node that forms the bottom layer of the tree index ,Right now:
[0107] .
[0108] S212, For the current layer (Initially, The top-level node set of ) The number of nodes is According to the set aggregation step size parameters , will the current layer Adjacent The images in each child node are stitched together to generate... The first The layer consists of aggregated image blocks, each constructed as follows:
[0109] ;
[0110] in, This indicates an image stitching operation; For the first The layers are spliced together. The first layer image obtained A collection of image patches, Indicates the first The layer is used to construct the first The first layer A collection of images of aggregated image patches: when When fixed, from index Starting with the image at index , select sequentially up to index . Images, enabling up to [number] clicks Aggregate in groups; when Cannot be When divisible, the upper bound index of the last group is This means that only the remaining sub-images are stitched together.
[0111] S213, will Each aggregated image patch of the layer is input into the embedding model. Extract its semantic vector representation:
[0112] .
[0113] Node representation obtained:
[0114] .
[0115] No. The node set of the layer is:
[0116] , .
[0117] S214, Each generated Layer nodes Record its corresponding child node list This establishes parent-child connections, forming the set of directed edges in the tree index. .
[0118] S215. Repeat steps S212-S214 until any stopping condition is met. The stopping condition is: the current number of top-level nodes is less than a threshold. Or the number of tree levels reaches a preset upper limit threshold. .
[0119] S22. Convert the natural language query input by aviation business personnel (such as "minimum bending radius requirement for laying this type of aluminum cable") into a query embedding, and then perform hierarchical semantic retrieval on a tree index structure to obtain the retrieval candidate set. ,in, This is the set of candidate nodes obtained using a greedy strategy. Given the candidate node set obtained using a fixed-width strategy, retrieve the candidate set. This refers to the set of nodes input to the subsequent score optimization module. Step S22 specifically includes the following sub-steps:
[0120] S221. Natural Language Query Embedding: Natural language queries entered by aviation business personnel. (For example, "Minimum bending radius requirement for laying this type of aluminum cable") After preprocessing, it is input into the same embedded model as the tree index construction stage. In this process, its embedding vector is obtained. In a tree index, let the root node be... Its subordinate nodes are , Indicates the first The first layer Each node has an embedded representation as follows: .
[0121] S222. A hybrid hierarchical retrieval scheme combining greedy search and fixed-width search is adopted to obtain the candidate set:
[0122] S2221. The greedy strategy selects the node most similar to the semantics of the query embedding vector from the current position at each level, recursively forming the most relevant query path layer by layer downwards, starting from the root node. Begin by calculating the current layer layer by layer. All nodes and query vectors cosine similarity score :
[0123] , ;
[0124] in, For the first Total number of nodes in the layer;
[0125] Nodes are added to a priority queue based on their similarity scores, and then the node with the highest score is selected as the best matching node for that layer. :
[0126] .
[0127] Then the child nodes of this node As the candidate set for the next layer, repeat step S2221 to eventually form a greedy path:
[0128] ;
[0129] in, The number of levels in the tree; Indicates from the first Layer to the first The best matching node of the layer, If the highest similarity score in the current layer is lower than a preset threshold, the search will stop prematurely, stopping the search at that layer. The best matching node is denoted as the end node. .
[0130] To ensure that all candidates are page-level leaf nodes, the terminal nodes are leafized:
[0131] .
[0132] S2222. Introduce a fixed-width strategy to perform horizontal expansion at each layer. For the first layer... Select all nodes in the layer that are related to... The most similar The nodes form a fixed-width candidate set. :
[0133] ;
[0134] in, This means taking the top two with the highest similarity. Then, all child nodes of these candidate nodes are aggregated into the node set of the next layer. Repeat step S222 until the query reaches the bottom leaf node, finally obtaining the union of the terminal nodes. :
[0135] .
[0136] in, This indicates the range from the 1st floor to the 2nd floor. All fixed-width candidate sets of the layer Perform a union operation to gather the total set of candidate nodes.
[0137] S223, Set the greedy strategy node set With fixed-width strategy node set union As a candidate set for retrieval.
[0138] S3. Perform multi-dimensional score optimization on the retrieval candidate set, which includes the following sub-steps:
[0139] S31. Score optimization based on likelihood value, the formula is as follows:
[0140] ;
[0141] in, This is the original search score. This is a likelihood score based on the spatial distribution pattern of similarity between pages in aviation documents. The fusion coefficient is... The score is optimized based on the likelihood value. Step S31 specifically includes the following sub-steps:
[0142] S311. Construct the likelihood function:
[0143] The set of candidate node pages obtained in step S2 using the hierarchical semantic retrieval method is as follows:
[0144] .
[0145] in, Given the total number of candidate nodes, the corresponding cosine similarity score is:
[0146] .
[0147] Statistical analysis of actual aviation document retrieval samples revealed that the greater the page distance between a candidate page and the actual answer page, the more exponentially the similarity score decreases. Therefore, let the distance to the target answer page be... The candidate pages of the page, whose scores follow a variance of . normal distribution , The similarity score varies with page distance. The expected value of the change is used to characterize the decay trend of "the greater the page spacing, the lower the similarity". Let the probability density function of the normal distribution be denoted as . For page similarity score Calculate the first The joint likelihood values for each location are the target answer page:
[0148] .
[0149] Right now Substituting the probability density function of the normal distribution into the equation, we get:
[0150] .
[0151] Taking the log-likelihood, we obtain the log-likelihood expression:
[0152] .
[0153] S312, Likelihood score reconstruction and weight fusion, where the optimization objective is:
[0154] .
[0155] Introducing hyperparameters Adjust the relative weights of the linear and squared terms to construct the likelihood score:
[0156] ;
[0157] Further introduce fusion coefficient The original score and the likelihood score are linearly combined to obtain the final adjusted score.
[0158] S32. Path context-based score optimization, defining the context score as:
[0159] ;
[0160] in, The score is optimized for the context. The embedding vector of the input aviation business query; This represents the embedding of a document page within the underlying nodes. Embed the page in the parent node at level L in the tree; This is the path decay coefficient, used to adjust the weight of the context scores at each layer, controlling the degree of influence of higher-level information on the score; For the first tree index layer.
[0161] Preferably, step S32 specifically includes the following sub-steps:
[0162] S321, Path Context Embedding Representation: For a specific page of a document Let the embedding sequence from the leaf node representing this node to the top-level node be:
[0163] ;
[0164] in, This is an embedded representation of the page itself; This indicates that the node is the [number]th node in the tree. The parent node of the layer is embedded. .
[0165] S322, Path Context Score: Introducing a path-weighted fusion mechanism, attenuation weights are added to embeddings at different levels, and the context score is defined as:
[0166] ;
[0167] in, Represents the user query vector; This is the path decay coefficient, used to adjust the weight of the context scores at each layer, controlling the degree of influence of higher-level information on the score.
[0168] S4. The scores from multiple dimensions are fused and ranked to obtain a set of relevant contexts to guide the generation of question-and-answer results. Step S4 specifically includes the following sub-steps:
[0169] S41. Constructing the fusion score: Based on the multi-dimensional scores of each candidate node obtained in step S3, the final score is calculated using the fusion formula:
[0170] ;
[0171] in, It is a weighting adjustment parameter. This serves as the basis for the node's participation in the final sorting and rearrangement; the multi-dimensional score includes the original cosine similarity score. Optimization score based on likelihood distribution Path context enhancement score .
[0172] S42. Diverse rearrangement strategies:
[0173] If only nodes with high scores are selected from highest to lowest during the sorting process, it can easily lead to duplicate selection of highly similar pages, resulting in information redundancy and affecting the diversity and effective length of the generated nodes. Therefore, a diversity re-sorting strategy is adopted. For the aforementioned candidate node set... And the score obtained after merging with it through step S41 .
[0174] S421. First, initialize the final candidate set. Each round uses the following heuristic objective function to... Select the node with the highest score and the lowest similarity to the already selected nodes. and join .
[0175] ;
[0176] in, Candidate nodes The fusion score; The weight of the redundancy penalty term; Indicates candidate nodes and Vector cosine similarity; express With the present The maximum similarity of the most similar nodes in a set.
[0177] S422. Repeat step S421 until the set is complete. Once the number of elements in the dataset reaches the preset maximum number of candidates, a final set of relevant pages is obtained to guide subsequent question-and-answer generation. .
[0178] S43. Generate question and answer results: Combined with user natural language queries The input is fed into a multimodal large language model for generative reasoning, resulting in the final aviation question-answering result. The generated result can simultaneously display relevant page information, including the standard name, page number, etc., to facilitate traceability for business personnel.
[0179] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made by those skilled in the art to the technical solutions of the present invention without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.
Claims
1. A tree index-based aviation domain multi-modal retrieval enhancement generation optimization method, characterized in that: It includes the following steps: S1. Construct the source dataset for a multimodal knowledge base in the aviation field; S2. Tree index construction and retrieval based on the embedding model, specifically including: S21. Construct a tree index based on the embedding model: Use the aviation multimodal pages obtained in S1 as the bottom-level nodes for representation, and complete node aggregation and tree index construction from bottom to top; S22, convert the input natural language query into a query embedding, perform hierarchical semantic retrieval on the tree index structure, and obtain a retrieval candidate set as wherein, is a candidate node set obtained by using a greedy strategy, is a candidate node set obtained by using a fixed-width strategy, and the retrieval candidate set is the node set input to the subsequent score optimization module. S3. Perform multi-dimensional score optimization on the retrieval candidate set, specifically including: S31. Score optimization based on likelihood value, the formula is as follows: ; in, This is the original search score. This is a likelihood score based on the spatial distribution pattern of similarity between pages in aviation documents. The fusion coefficient is... The score after likelihood optimization; S32. Path context-based score optimization, the formula is as follows: ; in, The score is optimized for the context. The embedding vector of the input aviation business query; This represents the embedding of a document page within the underlying nodes. For this page, the first in the tree The parent node of the layer is embedded; This is the path decay coefficient, used to adjust the weight of the context scores at each layer, controlling the degree of influence of higher-level information on the score; For the first tree index layer; S4. The scores of multiple dimensions are fused and sorted to obtain a set of relevant contexts to guide the generation of question-and-answer results, thus completing the optimization of multimodal retrieval enhancement in the aviation field.
2. The method for enhancing and optimizing multimodal retrieval in the aviation field based on tree indexing according to claim 1, characterized in that: The tree index constructed in step S21 is as follows: ; ; in, For the first Layer A collection of image block nodes, ; Indicates taking and The smaller one, This is the aggregation step size parameter; The number of nodes; express The set of child nodes; for In the The corresponding layer Number of child nodes After the termination condition is met, a tree index diagram is obtained. , It is the union of nodes at each level in the tree index. This is the set of directed edges in the tree index.
3. The method for enhancing and optimizing multimodal retrieval in the aviation field based on tree indexing according to claim 1, characterized in that: Step S1 specifically includes: collecting relevant standards, airworthiness regulations, maintenance manuals, design manuals, and test reports in the aviation field, and parsing each document into a page-by-page image-organized format to obtain a multimodal knowledge base in the aviation field. The source dataset, ; in, For the knowledge base Page 1 This represents the total number of pages in all documents of the knowledge base.
4. The method for enhancing and optimizing multimodal retrieval in the aviation field based on tree indexing according to claim 1, characterized in that: Step S21 specifically includes the following sub-steps: S211, Bottom-level node representation: Knowledge base Each page image in Input multimodal vector embedding model Extract the vector embedding representation of each page image. From page images Its vector embedding Each node that forms the bottom layer of the tree index ,Right now: ; S212, For the current layer top-level node set The number of nodes is According to the set aggregation step size parameters , will the current layer Adjacent The images in each child node are stitched together to generate... The first The layer consists of aggregated image blocks, each constructed as follows: ; Initially, ; For image stitching operations; For the first The layers are spliced together. The first layer image obtained A collection of image patches, Indicates the first The layer is used to construct the first The first layer A collection of images of aggregated image patches; S213, will Each aggregated image patch of the layer is input into a multimodal vector embedding model. Extract its semantic vector express: ; Get Node Represented as: ; No. The node set of the layer is: , ; S214, Record each generated first... Layer nodes Corresponding child node list This establishes parent-child connections, forming the set of directed edges in the tree index. ; S215. Repeat steps S212-S214 until any termination condition is met, at which point the tree index diagram is obtained. .
5. The method for enhancing and optimizing multimodal retrieval in the aviation field based on tree indexing according to claim 4, characterized in that: In step S212, when When fixed, from index Starting with the image at index , select sequentially up to index . Images, enabling up to [number] clicks Aggregate in groups; when Cannot be When divisible, the upper bound index of the last group is That is, only the remaining sub-images are stitched together; The termination condition in step S215 is: the current number of top-level nodes is less than the threshold. Or the number of tree levels reaches a preset upper limit threshold. .
6. The method for enhancing and optimizing multimodal retrieval in the aviation field based on tree indexing according to claim 1, characterized in that: Step S22 specifically includes the following sub-steps: S221. Natural Language Query Embedding: Embedding natural language queries entered by aviation business personnel. After preprocessing, the data is input into the multimodal vector embedding model. In this process, the corresponding embedding vector is obtained. In a tree index, let the root node be... The child nodes of the root node are , Indicates the first The first layer Each node has an embedded representation as follows: ; S222. A hybrid hierarchical retrieval scheme combining greedy search and fixed-width search is adopted to obtain the candidate set, specifically: S2221. Using a greedy strategy, at each level, select the node most similar to the semantics of the query embedding vector from the current position, and recursively form a query path with the highest relevance layer by layer, starting from the root node. Begin by calculating the current layer layer by layer. All nodes and query vectors cosine similarity score : , ; in, For the first Total number of nodes in the layer; Nodes are added to a priority queue based on their similarity scores, and then the node with the highest score is selected as the best matching node for that layer. : ; Then the child nodes of this node As the candidate set for the next layer, repeat step S2221 to eventually form a greedy path: ; in, The number of levels in the tree; Indicates from the first Layer to the first The best matching node of the layer, If the highest similarity score in the current layer is lower than a preset threshold, the search will stop prematurely, stopping the search at that layer. The best matching node is denoted as the end node. ; S2222. Introduce a fixed-width strategy to perform horizontal expansion at each layer. For the first layer... Select all nodes in the layer that are related to... The most similar The nodes form a fixed-width candidate set. : ; in, This means taking the top two with the highest similarity. There are 10 candidate nodes, and then all the child nodes of these candidate nodes are gathered into the node set of the next layer. Repeat step S2222 until the query reaches the bottom leaf node, finally obtaining the union of the terminal nodes. : ; in, This indicates the range from the 1st floor to the 2nd floor. All fixed-width candidate sets of the layer Perform a union operation to gather the total set of candidate nodes; S223, Set the greedy strategy node set With fixed-width strategy node set union As a candidate set for retrieval.
7. The method for enhancing and optimizing multimodal retrieval in the aviation field based on tree indexing according to claim 1, characterized in that: Step S4 specifically includes the following sub-steps: S41. Constructing the fusion score: Based on the multi-dimensional scores of each candidate node obtained in step S3, the final score is calculated using the fusion formula: ; in, For weight adjustment parameters, The final score after fusion; S42. Diverse Reordering Strategy: For the obtained candidate node set And the score obtained after merging with it through step S41 The pages are reordered to obtain a set of relevant pages used to guide the generation of subsequent questions and answers. , The total number of candidate nodes; this step specifically includes: S421. Initialize the final candidate set. In each round, candidate nodes are selected from the set using the following heuristic objective function. Select the node with the highest score and the lowest similarity to the already selected nodes. and join ; ; in, Candidate nodes The fusion score; The weight of the redundancy penalty term; Candidate nodes and Vector cosine similarity; express With the present The maximum similarity between the most similar nodes in a set; S422. Repeat step S421 until the set is complete. Once the number of nodes reaches the preset upper limit for the number of candidates, a set of relevant pages is obtained to guide the generation of subsequent question-and-answer questions. ; S43. Generate question and answer results: Combined with user natural language queries The input is fed into a multimodal large language model for generative reasoning, resulting in the final aviation question-answering results.
8. The method for enhancing and optimizing multimodal retrieval in the aviation field based on tree indexing according to claim 1, characterized in that: Step S31 specifically includes the following sub-steps: S311. Construct the log-likelihood expression as follows: ; in, Indicates the number of candidate nodes; let the distance from the target answer page be... Candidate pages of a page, with scores following a variance of . normal distribution ,function The similarity score indicates that the score varies with the score. The expected value of the change; S312. By reconstructing the likelihood score and fusing it with the weights, we obtain a score optimization formula based on the likelihood value: The optimization objective is: ; Introducing hyperparameters Adjust the relative weights of the linear and squared terms to construct the likelihood score: ; Finally, the fusion coefficient is introduced. The original score and the likelihood score are linearly combined to obtain the final adjusted score.
9. The method for enhancing and optimizing multimodal retrieval in the aviation field based on tree indexing according to claim 1, characterized in that: Step S32 specifically includes the following sub-steps: S321, Path Context Embedding Representation: For a specific page of a document Let the embedding sequence from the leaf node representing this node to the top-level node be: ; in, This is an embedded representation of the page itself; This indicates that the node is the [number]th node in the tree. The parent node of the layer is embedded. ; S322, Path Context Score: Introducing a path weighted fusion mechanism, attenuation weights are added to embeddings at different levels to obtain the context score formula.
10. The method for enhancing and optimizing multimodal retrieval in the aviation field based on tree indexing according to claim 6, characterized in that: In step S2221, the terminal nodes are leaf-mapped: ; The above formula yields This is the set of candidate pages for the greedy strategy.