Artificial intelligence question and answer accuracy optimization method, device, equipment and storage medium
By performing semantic unit segmentation and vectorization encoding on multi-source heterogeneous data, a hierarchical semantic index tree is constructed. Combined with the RoBERTa model for intent recognition and SimCSE-RoBERTa encoding, the problem of insufficient semantic alignment in intelligent question answering systems is solved, and efficient and accurate question answering results are generated.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SCIMEE TECH & SCI CO LTD
- Filing Date
- 2026-06-01
- Publication Date
- 2026-06-30
AI Technical Summary
Existing intelligent question answering systems lack a unified semantic alignment mechanism in multi-source heterogeneous data scenarios, resulting in relevant retrievals but inaccurate or inconsistent answers with the original data, making it difficult to accurately select high-quality answers.
By segmenting multi-source heterogeneous data into semantic units, using Sentence-BERT, TaBERT, and CLIP encoders for vectorization encoding, a hierarchical semantic index tree structure is constructed. The RoBERTa model is combined for intent recognition and entity extraction to generate structured query triples. SimCSE-RoBERTa is then used for contrastive learning encoding to calculate multi-dimensional semantic similarity and credibility scores. Finally, the optimal prompt words are reconstructed to optimize the question-answering results.
It achieves unified semantic representation and efficient retrieval of multi-source heterogeneous data, ensuring a high degree of consistency between the answer content and the user's intent and the original data, effectively overcoming the AI illusion problem, and improving the accuracy and reliability of the question-answering system.
Smart Images

Figure CN122309686A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, and in particular to a method, apparatus, device and storage medium for optimizing the accuracy of AI question answering. Background Technology
[0002] In recent years, Large Language Models (LLMs) have demonstrated significant capabilities in natural language processing tasks such as text generation and question answering. Large models such as OpenAI's GPT series, Anthropic's Claude series, and DeepSeek series have achieved breakthroughs in various benchmark tests. However, LLMs primarily rely on parameterized knowledge for reasoning and generation. This knowledge is stored in fixed model parameters during training and cannot be updated in real time. This leads to the models being prone to factual errors or completely fictitious content when handling open-ended, time-sensitive, or domain-specific tasks—a phenomenon known as AI hallucination.
[0003] The AI illusion problem is particularly prominent in enterprise data query scenarios. When LLM is directly applied to enterprise data self-service query, the model may generate answers that do not match the actual business data due to a lack of real-time data perception capabilities.
[0004] To alleviate the illusion problem of large language models, Retrieval-Augmented Generation (RAG) technology has emerged. It assists LLMs in generating answers by retrieving real-time information from external knowledge bases and has been widely applied in question-answering systems and intelligent search. The core mechanism of RAG technology lies in breaking down query processing into three stages: "retrieval-augmentation-generation." First, semantic retrieval of documents in the knowledge base is performed using a vector database. Then, the retrieved relevant context is concatenated with the user query and input into the large language model. Finally, the model generates an answer that incorporates external knowledge. Introducing RAG can significantly improve the recall rate of relevant documents; a manufacturing company's experience shows that query accuracy increased from 58% to 82%. RAG technology has thus become a key bridge connecting LLMs and enterprise private data, demonstrating significant advantages in multi-source knowledge-enhanced question-answering tasks.
[0005] Existing RAG-based intelligent question answering systems still have significant limitations in multi-source data scenarios. First, multi-source heterogeneous data (such as text, tables, and images) are usually modeled independently, lacking a unified semantic alignment mechanism. This makes it difficult to effectively integrate cross-modal information, resulting in semantic discrepancies between retrieval results and questions, affecting the overall accuracy of question answering. Second, existing methods mainly rely on single similarity matching during retrieval and generation, lacking deep modeling of the consistency between "question-candidate answer-data source," which easily leads to problems such as relevant retrieval but inaccurate answers or inconsistencies with the original data.
[0006] Furthermore, existing systems generally lack effective result optimization and filtering mechanisms. For example, they do not fully consider the redundancy, diversity, and differences in data source reliability among candidate answers, making it difficult to accurately select high-quality answers. At the same time, prompt word construction is mostly based on static templates, lacking dynamic optimization capabilities based on search results and mechanisms for adaptive adjustments based on user feedback, thus limiting further improvements in overall system performance. Summary of the Invention
[0007] The main objective of this application is to provide an AI question-answering accuracy optimization method, apparatus, device, and storage medium to solve the problems in existing intelligent question-answering systems, such as the lack of a unified semantic alignment mechanism, the tendency for relevant retrievals to result in inaccurate or inconsistent answers with the original data, and the difficulty in accurately selecting high-quality answers.
[0008] To achieve the above objectives, this application provides the following technical solution.
[0009] An AI question-answering accuracy optimization method includes the following steps.
[0010] Step S1: Semantic units are segmented from the multi-source heterogeneous data, and vectorized and encoded using Sentence-BERT, TaBERT, and CLIP encoders respectively to obtain a set of multi-source semantic vectors.
[0011] Step S2: The multi-source semantic vector set is constructed into a hierarchical semantic index tree structure through two-layer K-means clustering.
[0012] Step S3: In response to an external user's question, the hierarchical semantic index tree structure is combined with the user's question, and intent recognition and entity extraction are performed through the RoBERTa model to obtain a structured query triple.
[0013] Step S4: Fill the structured query triples into the multi-type prompt template to generate candidate prompt words, and perform comparative learning encoding through SimCSE-RoBERTa to obtain a candidate prompt word encoding set.
[0014] Step S5: Perform double semantic alignment on the candidate prompt word encoding set to calculate multidimensional semantic similarity and credibility score, and obtain a comprehensive score candidate answer set.
[0015] Step S6: Sort and filter the comprehensive scoring candidate answer set to reconstruct the optimal prompt words and input them into the large language model to obtain the optimized AI question answering results.
[0016] Beneficial effects of steps S1 to S6: This method forms a complete technical closed loop of multi-source data semantic alignment and AI question answering optimization through steps S1 to S6, which effectively overcomes the technical problems of inconsistent semantic representation, low retrieval accuracy, lack of deep semantic consistency verification, and easy generation of AI illusion in traditional RAG systems under multi-source heterogeneous data scenarios. This method first segments multi-source heterogeneous data into semantic units in step S1, using Sentence-BERT, TaBERT, and CLIP encoders for vectorization encoding to construct a unified semantic vector space, addressing the cross-modal semantic gap between text, tables, and images. Then, in step S2, a hierarchical semantic index tree structure is built using two-layer K-means clustering, providing a data foundation for efficient retrieval of large-scale semantic vectors. Step S3 responds to user queries, using the RoBERTa model to achieve intent recognition and entity extraction, transforming natural language queries into structured query triples for accurate understanding of user needs. Step S4 fills the structured triples into multi-type prompt templates to generate candidate prompt words, and enhances semantic alignment capabilities through SimCSE-RoBERTa contrastive learning. Step S5 ensures consistency between candidate answers and both user intent and data source through a dual semantic alignment mechanism, combining credibility scoring to filter high-quality candidate answers. Finally, step S6 reconstructs the optimal prompt words and drives a large language model to generate accurate and reliable answers, achieving end-to-end optimization from data collection, index construction, intent understanding, prompt word generation, quality filtering to the final answer.
[0017] In step S1, the text is segmented into units using a sliding window and semantic coherence detection. Tables are then segmented into key-value structures by row and logical partitioning. Visual element recognition generates a structured description of the image. These three semantic blocks are encoded and mapped to a 768-dimensional unified space, eliminating modal differences. Step S2 eliminates the influence of vector magnitude length through L2 norm normalization. The number of cluster centers is determined by taking the square root of the total number of vectors. A two-layer index structure, from coarse clustering to sub-clustering, is constructed. During retrieval, cosine similarity is matched layer by layer, reducing the retrieval complexity from O(N) to O(logN). Step S3 predicts the intent category and detects entity boundaries using RoBERTa sequence labeling branches. Step S4 generates triples and simultaneously matches the user's question vector with the index tree layer by layer to obtain Top-K candidate semantic blocks for verification. Step S5 generates initial candidate prompt words using three types of prompt templates: text, table, and image. It then uses the SimCSE-RoBERTa InfoNCE loss layer for comparative learning optimization. Step S6 calculates the semantic similarity between the candidate answer and the user's question, as well as the consistency score with the data source, and integrates credibility indicators such as the degree of structuring, update timeliness, and source authority. Step S7 sorts the candidates according to the comprehensive score, selects the best candidates, reconstructs the prompt words, and explicitly injects them into the original data fragments to guide the large language model to generate accurate answers that are highly consistent with the data source.
[0018] As a further improvement of this application, step S1 involves segmenting the multi-source heterogeneous data into semantic units and then vectorizing them using Sentence-BERT, TaBERT, and CLIP encoders to obtain a multi-source semantic vector set, including the following steps.
[0019] Step S1.1: Collect multi-source heterogeneous raw data and perform format normalization processing to obtain a standardized multi-source dataset.
[0020] Step S1.2: Modal splitting is performed on the standardized multi-source dataset to obtain text dataset, table dataset, and image dataset.
[0021] Step S1.3: The text dataset is segmented into units by using a sliding window combined with semantic coherence detection to obtain a set of text semantic blocks.
[0022] Step S1.4: Perform key-value structured segmentation on the table dataset by row and logical partition to obtain a set of table semantic blocks.
[0023] Step S1.5: Perform visual element recognition and structured description generation on the image dataset to obtain an image semantic block set.
[0024] Step S1.6: The set of text semantic blocks is vectorized and encoded using the Sentence-BERT pre-trained model to obtain a set of text semantic vectors.
[0025] Step S1.7: The set of table semantic blocks is vectorized and encoded using a TaBERT pre-trained model to obtain a set of table semantic vectors.
[0026] Step S1.8: The image semantic block set is vectorized and encoded using the CLIP visual encoder to obtain an image semantic vector set.
[0027] Step S1.9: Merge the text semantic vector set, the table semantic vector set, and the image semantic vector set to obtain a multi-source semantic vector set.
[0028] Beneficial effects of steps S1.1 to S1.9: This series of steps, through semantic unit segmentation and multimodal coding, realizes a unified semantic representation of multi-source heterogeneous data, solves the problem of cross-modal semantic gap between text, tables, and images, and lays a data foundation for subsequent hierarchical index construction and cross-modal retrieval.
[0029] The process includes the following steps: Step S1.1: Data acquisition and format normalization to eliminate differences in data sources; Step S1.2: Modal segmentation to provide a basis for differentiated processing; Step S1.3: Accurate segmentation of text semantic blocks through sliding window and semantic coherence detection to ensure semantic integrity; Step S1.4: Preservation of table structure information through key-value structured segmentation to maintain business semantic units; Step S1.5: Structured description of image semantics through visual element recognition to convert unstructured images into searchable text; Step S1.6: Text semantic vectorization through Sentence-BERT encoding; Step S1.7: Table semantic vectorization through TaBERT encoding; Step S1.8: Image semantic vectorization through CLIP encoding; and Step S1.9: Merging of the three types of semantic vectors to construct a unified multi-source semantic vector space.
[0030] As a further improvement to this application, step S2 involves constructing the multi-source semantic vector set into a hierarchical semantic index tree structure through two-layer K-means clustering, including the following steps.
[0031] Step S2.1: Perform L2 norm global normalization on the multi-source semantic vector set to obtain a normalized multi-source semantic vector set.
[0032] Step S2.2: Count the total number of vectors in the normalized multi-source semantic vector set, and calculate the number of first-layer cluster centers by taking the arithmetic square root of the total number of vectors.
[0033] Step S2.3: Input the normalized multi-source semantic vector set and the number of cluster centers in the first layer into the K-means clustering algorithm to obtain the first layer cluster center vector set and the first layer intra-cluster vector grouping set.
[0034] Step S2.4: Count the number of group vectors for each group in the first layer of cluster vector grouping set, and calculate the number of second-layer cluster centers for each group by taking the arithmetic square root of the number of group vectors.
[0035] Step S2.5: Input the first layer cluster in-cluster vector grouping set and the second layer cluster center number into the K-means clustering algorithm to obtain the second layer cluster center vector set and the second layer cluster in-cluster vector grouping set.
[0036] Step S2.6: Perform hierarchical association binding between the first layer cluster center vector set and the second layer cluster center vector set to obtain a hierarchical cluster center association structure.
[0037] Step S2.7: Bind the vector grouping set within the second-level cluster to the hierarchical cluster center association structure using vector pointers to obtain a set of hierarchical tree nodes with vector indexes.
[0038] Step S2.8: Perform topological sorting and structural solidification on the hierarchical tree node set to obtain the hierarchical semantic index tree structure.
[0039] Beneficial effects of steps S2.1 to S2.8: This series of steps constructs a hierarchical semantic index tree through two-layer K-means clustering, realizing efficient organization and fast retrieval of large-scale semantic vectors, reducing the retrieval time complexity from O(N) to O(log N).
[0040] The process involves the following steps: Step S2.1 performs L2 norm normalization to eliminate the influence of vector magnitude on distance calculation; Step S2.2 determines the number of cluster centers in the first layer by taking the square root of the total number of vectors, achieving automatic adaptation of the cluster size; Step S2.3 completes the first layer of K-means clustering to generate coarse-grained cluster partitions; Step S2.4 determines the number of cluster centers in the second layer by taking the square root of the number of grouping vectors, achieving adaptive fine-grained cluster size; Step S2.5 completes the second layer of K-means clustering to generate fine-grained cluster partitions; Step S2.6 completes hierarchical association binding to establish the parent-child relationship between the two layers of cluster centers; Step S2.7 completes vector pointer binding to establish the mapping between nodes and original semantic blocks; and Step S2.8 completes topological sorting and structure solidification to generate a persistently stored hierarchical semantic index tree structure.
[0041] As a further improvement to this application, step S3, in response to an external user's question, combines the hierarchical semantic index tree structure with the user's question, and uses the RoBERTa model to perform intent recognition and entity extraction to obtain a structured query triple, including the following steps.
[0042] Step S3.1: In response to an external user's question, perform text standardization and noise character filtering on the externally input user question to obtain the standardized user question text.
[0043] Step S3.2: Semantically encode the standardized user question text using the RoBERTa pre-trained model to obtain the user question semantic vector.
[0044] Step S3.3: Perform layer-by-layer cosine similarity matching between the user's query semantic vector and the hierarchical semantic index tree structure to obtain a Top-K candidate semantic block set.
[0045] Step S3.4: Input the standardized user question text into the RoBERTa sequence labeling branch to perform intent category label prediction and obtain the user question intent label.
[0046] Step S3.5: Input the standardized user question text into the RoBERTa entity recognition branch to perform named entity boundary detection and type labeling to obtain the user question entity set.
[0047] Step S3.6: Perform field alignment mapping between the user's question intent tag and the user's question entity set to obtain a set of intent-entity association pairs.
[0048] Step S3.7: Perform semantic association verification between the intent-entity association pair set and the Top-K candidate semantic block set to obtain the verified intent-entity association pair set.
[0049] Step S3.8: The verified intent-entity association set is restructured in a structured manner to obtain a structured query triplet.
[0050] Beneficial effects of steps S3.1 to S3.8: This series of steps, through intent recognition and entity extraction, realizes the transformation of users' natural language questions into structured queries, accurately understands user needs, and provides accurate semantic expression for subsequent prompt word generation.
[0051] The process includes the following steps: Step S3.1: Text standardization and noise filtering to improve subsequent encoding quality; Step S3.2: Generating user query semantic vectors through RoBERTa encoding to provide vector representation for similarity matching; Step S3.3: Rapidly locating Top-K candidate semantic blocks through layer-wise cosine similarity matching to provide relevance for intent verification; Step S3.4: Predicting intent category labels through the RoBERTa sequence labeling branch to achieve automatic classification of user needs; Step S3.5: Detecting named entity boundaries through the RoBERTa entity recognition branch to extract key query elements; Step S3.6: Completing field alignment mapping between intent labels and entity sets to generate intent-entity association pairs; Step S3.7: Ensuring the accuracy and relevance of the extracted results through semantic association verification and filtering out misidentified results; and Step S3.8: Completing structured recombination to generate (entity, intent, attribute) triples to provide structured input for prompt word generation.
[0052] As a further improvement of this application, step S4 involves filling the structured query triples into a multi-type prompt template to generate candidate prompt words, and performing comparative learning encoding through SimCSE-RoBERTa to obtain a candidate prompt word encoding set, including the following steps.
[0053] Step S4.1: Split the structured query triples into entity fields, intent fields, and attribute fields.
[0054] Step S4.2: Fill the entity field, intent field, and attribute field into the predefined text, table, and image prompt templates respectively to obtain the initial candidate prompt word set.
[0055] Step S4.3: Standardize the text format and filter special characters of the initial candidate prompt word set to obtain a standardized candidate prompt word set.
[0056] Step S4.4: The standardized candidate prompt word set is initially semantically encoded using the SimCSE-RoBERTa pre-trained model to obtain an initial prompt word vector set.
[0057] Step S4.5: Construct semantically similar positive sample pairs and a global negative sample queue for the initial prompt word vector set to obtain a comparative learning sample set.
[0058] Step S4.6: Input the contrastive learning sample set into the InfoNCE loss layer of SimCSE-RoBERTa to perform vector space alignment optimization, and obtain the optimized prompt word vector set.
[0059] Step S4.7: Perform L2 norm global normalization on the optimized prompt word vector set to obtain a normalized prompt word vector set.
[0060] Step S4.8: Bind the normalized prompt word vector set to the standardized candidate prompt word set one by one to obtain the candidate prompt word encoding set.
[0061] Beneficial effects of steps S4.1 to S4.8: This series of steps, through multi-type prompt templates and contrastive learning encoding, realizes the dynamic generation and semantic alignment optimization of candidate prompt words, and enhances the semantic matching ability between prompt words and user queries.
[0062] The process includes the following steps: Step S4.1 splits the triplet field to provide structured input for template filling; Step S4.2 generates diverse candidate prompt words using predefined multi-type templates to cover query scenarios from different data sources; Step S4.3 standardizes text format and filters special characters to improve subsequent encoding quality; Step S4.4 generates initial prompt word vectors using SimCSE-RoBERTa encoding to provide a basic representation for comparative learning optimization; Step S4.5 establishes training objectives for comparative learning by constructing positive and negative sample pairs to enhance semantic discrimination capabilities; Step S4.6 optimizes the vector space distribution using the InfoNCE loss layer to cluster semantically similar prompt words in the vector space; Step S4.7 performs L2 norm normalization to unify vector magnitudes for easier subsequent similarity calculations; and Step S4.8 binds vectors to text to establish a candidate prompt word encoding set.
[0063] As a further improvement of this application, step S5 involves performing dual semantic alignment on the candidate prompt word encoding set to calculate multidimensional semantic similarity and credibility scores, thereby obtaining a comprehensive score candidate answer set, including the following steps.
[0064] Step S5.1: Input the standardized candidate prompt words corresponding to the candidate prompt word encoding set into the large language model to trigger the retrieval of associated semantic blocks and the generation of answers, thereby obtaining the initial candidate answer set and the associated data source semantic block set.
[0065] Step S5.2: Semantically encode the initial candidate answer set and the associated data source semantic block set to obtain the candidate answer semantic vector set and the data source semantic vector set.
[0066] Step S5.3: Calculate the cosine similarity between the candidate answer semantic vector set and the data source semantic vector set to obtain a semantic similarity score set.
[0067] Step S5.4: Perform statistical calculations on the structuring degree, update timeliness, and source authority of the semantic block set of the associated data source to obtain a set of data source credibility scores.
[0068] Step S5.5: Perform linear weighted fusion of the semantic similarity score set and the data source credibility score set to obtain the comprehensive score set of candidate answers.
[0069] Step S5.6: Bind the comprehensive score set of candidate answers to the initial candidate answer set one by one to obtain the comprehensive score candidate answer set.
[0070] Beneficial effects of steps S5.1 to S5.6: This series of steps achieves multi-dimensional quality assessment of candidate answers through dual semantic alignment and credibility scoring, ensuring that the answer content is consistent with both the user's intent and the original data, and effectively filtering high-quality answers.
[0071] The process includes the following steps: Step S5.1 generates candidate answers and associates them with the data source, establishing a source relationship between the answers and the original data; Step S5.2 generates answer vectors and data source vectors through semantic encoding, providing vector representation for similarity calculation; Step S5.3 calculates the cosine similarity between candidate answers and the data source, quantifying the degree of consistency between the answers and the original data; Step S5.4 evaluates the structured nature, update timeliness, and source authority of the data source through statistical calculations, generating a credibility score; Step S5.5 performs a linear weighted fusion of semantic similarity and credibility to achieve a multi-dimensional comprehensive score; and Step S5.6 binds the comprehensive score to the candidate answers, constructing a sortable and filterable set of comprehensive score candidate answers.
[0072] As a further improvement to this application, step S6 involves sorting and filtering the comprehensive scoring candidate answer set to reconstruct the optimal prompt words and input them into a large language model to obtain the optimized AI question-answering results, including the following steps.
[0073] Step S6.1: Sort the candidate answer set according to the comprehensive score from high to low to obtain the sorted candidate answer set.
[0074] Step S6.2: Extract the semantic block of the associated data source corresponding to the optimal candidate answer in the sorted candidate answer set to obtain the optimal data source semantic block.
[0075] Step S6.3: Concatenate the semantic block of the optimal data source with the structured query triple to obtain the basic content of the reconstructed prompt words.
[0076] Step S6.4: Standardize the format and filter redundant content of the basic content of the reconstruction prompt words to obtain standardized reconstruction prompt words.
[0077] Step S6.5: Input the standardized reconstructed prompt words into the context input layer of the large language model, and perform attention mask generation and position encoding to obtain the final input sequence of the model.
[0078] Step S6.6: Input the final input sequence of the model into the large language model for autoregressive generation to obtain the initial generated response text.
[0079] Step S6.7: The format of the initially generated answer text is standardized and special characters are cleaned up to obtain the optimized AI question-and-answer result.
[0080] Beneficial effects of steps S6.1 to S6.7: This series of steps, through sorting, filtering and prompt word reconstruction, achieves accurate extraction of the optimal answer and explicit injection of raw data, guiding the large language model to generate accurate answers that are highly consistent with the data source, thus curbing the generation of AI illusions from the source.
[0081] The process involves the following steps: Step S6.1 sorting candidate answers in descending order to quickly locate the optimal answer; Step S6.2 extracting semantic blocks from the associated data sources of the optimal candidate answer to provide raw data fragments for prompt word reconstruction; Step S6.3 concatenating the semantic blocks from the data sources with structured triples to build the content foundation for reconstructing the prompt words; Step S6.4 standardizing the format and filtering redundancy to improve the standardization and information density of the prompt words; Step S6.5 generating attention masks and positional encoding to adapt to the input format requirements of the large language model; Step S6.6 generating accurate answers through autoregressive generation using the large language model; and Step S6.7 format regularization and special character cleanup to output the final optimized AI question-answering result.
[0082] To achieve the above objectives, this application also provides the following technical solutions.
[0083] An AI question-answering accuracy optimization device is provided, which is applied to the AI question-answering accuracy optimization method described above, and the AI question-answering accuracy optimization device includes the following steps.
[0084] The multi-source heterogeneous data segmentation module is used to segment multi-source heterogeneous data into semantic units, and then perform vectorization encoding through Sentence-BERT, TaBERT, and CLIP encoders respectively to obtain a multi-source semantic vector set.
[0085] The hierarchical semantic index tree structure construction module is used to construct a hierarchical semantic index tree structure from the multi-source semantic vector set through two-layer K-means clustering.
[0086] The user query identification and extraction module is used to respond to external user queries by combining the hierarchical semantic index tree structure with the user query and performing intent identification and entity extraction through the RoBERTa model to obtain structured query triples.
[0087] The candidate prompt word encoding set generation module is used to fill the structured query triples into a multi-type prompt template to generate candidate prompt words, and to perform comparative learning encoding through SimCSE-RoBERTa to obtain the candidate prompt word encoding set.
[0088] The candidate prompt word encoding semantic alignment module is used to perform double semantic alignment on the candidate prompt word encoding set to calculate multi-dimensional semantic similarity and credibility scores, and obtain a comprehensive score candidate answer set.
[0089] The AI question-answering result optimization module is used to sort and filter the comprehensive score candidate answer set, reconstruct the optimal prompt words and input them into the large language model to obtain the optimized AI question-answering result.
[0090] To achieve the above objectives, this application also provides the following technical solutions.
[0091] An electronic device includes a processor and a memory coupled to the processor, the memory storing program instructions executable by the processor; when the processor executes the program instructions stored in the memory, it implements the AI question-answering accuracy optimization method described above.
[0092] To achieve the above objectives, this application also provides the following technical solutions.
[0093] A computer-readable storage medium storing program instructions that, when executed by a processor, enable the implementation of the AI question-answering accuracy optimization method described above. Attached Figure Description
[0094] Figure 1 This is a flowchart illustrating the steps of an embodiment of an AI question-answering accuracy optimization method according to this application.
[0095] Figure 2 This is a schematic diagram of the functional modules of an embodiment of an AI question-answering accuracy optimization device according to this application.
[0096] Figure 3 This is a schematic diagram of the structure of an embodiment of the electronic device of this application.
[0097] Figure 4 This is a schematic diagram of the structure of one embodiment of the storage medium of this application. Detailed Implementation
[0098] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0099] The terms "first," "second," and "third" in this application are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of that feature. In the description of this application, "multiple" means at least two, such as two, three, etc., unless otherwise explicitly specified. All directional indications (such as up, down, left, right, front, back, etc.) in the embodiments of this application are only used to explain the relative positional relationships and movements between components in a specific orientation (e.g., as shown in the figures). If the specific orientation changes, the directional indications also change accordingly. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or devices.
[0100] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a mutually exclusive, independent, or alternative embodiment. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.
[0101] It should be noted that, due to the limited types and number of symbols or letters that can represent specific meanings, for embodiments with many formulas or codes, there may be situations where symbols or letters cannot meet the usage requirements. Therefore, the interpretation of formula symbols in the steps or sub-steps of the embodiments is only valid for the current step or sub-step.
[0102] If the same symbol has different interpretations in different steps or sub-steps, the interpretation in the current step or sub-step shall prevail; if the same symbol appears in different steps or sub-steps, but no interpretation is given in subsequent steps or sub-steps after its first appearance, the interpretation in the first step or sub-step shall be used.
[0103] For example, "i=1,2,...,n" is a writing convention and a well-known meaning. If "i=1,2,...,n" has already appeared once in an embodiment and needs to be used again in formulas with different scenarios and meanings in the future, then the meaning of "i=1,2,...,n" should be interpreted according to the corresponding formulas. If "i=1,2,...,n" is not used for the first time and is changed to a symbol or letter with a non-well-known meaning, such as "p=1,2,...,q", to avoid repetition, it is more likely to cause confusion, ambiguity, and unclear problems.
[0104] like Figure 1 As shown, this embodiment provides an example of an AI question-answering accuracy optimization method, which includes the following steps.
[0105] Step S1: Semantic units are segmented from the multi-source heterogeneous data, and vectorized and encoded using Sentence-BERT, TaBERT, and CLIP encoders respectively to obtain a set of multi-source semantic vectors.
[0106] Furthermore, step S1 specifically includes the following steps.
[0107] Step S1.1: Collect multi-source heterogeneous raw data and perform format normalization processing to obtain a standardized multi-source dataset.
[0108] Preferably, the core of this step is to complete the collection and format unification of multi-source heterogeneous raw data, and to process data from different sources through preset normalization rules to achieve data format standardization.
[0109] Among them, the multi-source heterogeneous raw data includes Word documents, PDF documents, Excel spreadsheets, charts and images from within the enterprise, and the data sources cover a variety of channels such as local file systems, relational databases, and knowledge bases.
[0110] Among them, the format normalization process includes uniformly converting the file encoding to UTF-8, uniformly formatting the timestamp to YYYY-MM-DD HH:MM:SS, uniformly retaining the numerical precision to two decimal places, and uniformly filling missing fields with NULL markers.
[0111] Preferably, file encoding detection uses the chardet library for automatic identification; timestamp parsing supports multiple input formats; and numerical precision processing uses a rounding strategy.
[0112] Step S1.2: Modal splitting is performed on the standardized multi-source dataset to obtain text dataset, table dataset, and image dataset.
[0113] Preferably, the core of this step is to complete the modality classification of the data, and to split the multi-source data into three modalities: text, table, and image by identifying file extensions and parsing content.
[0114] Among them, for modality splitting rules, text modalities include plain text or document formats such as .txt, .doc, .docx, and .pdf; table modalities include table formats such as .xls, .xlsx, and .csv; and image modalities include image formats such as .jpg, .png, and .bmp.
[0115] In handling boundary cases, PDF documents are treated as images first, and if extractable text content is detected, they are converted to text modality; tables in nested documents are extracted separately as table modality.
[0116] Step S1.3: The text dataset is segmented into units by using a sliding window combined with semantic coherence detection to obtain a set of text semantic blocks.
[0117] Preferably, the core of this step is to complete the semantic unit segmentation of the text data, traverse the text through a sliding window, and calculate the semantic coherence score of adjacent windows in combination with a pre-trained language model to extract semantically complete text blocks.
[0118] Among them, the sliding window parameters are: window size of 256 words, step size of 64 words, and overlap rate between windows of 75%, to ensure accurate identification of semantic boundaries.
[0119] For semantic coherence detection, a pre-trained BERT model is used to calculate the cosine similarity of the [CLS] vectors of adjacent windows. The similarity calculation formula is as follows: .
[0120] in, Let [CLS] be the [CLS] vector of the i-th window. When the cosine similarity is below the threshold of 0.65, it is marked as a semantic boundary.
[0121] In the construction of text semantic blocks, each semantic block retains complete sentence boundaries to avoid splitting in the middle of the sentence, and records the metadata of the semantic block, including the source file name, start position, end position, and word count.
[0122] For metadata, a JSON structure is used for storage, for example: .
[0123] Where f is the source file name. The starting character position, The position of the terminating character is l, and the number of characters is l.
[0124] For example, a product instruction manual contains 10,240 words. After traversing through a sliding window, 15 locations with semantic coherence below the threshold were identified. These locations were then segmented into 16 semantic blocks, each containing between 500 and 800 words, with semantic boundaries located at the end of paragraphs.
[0125] Step S1.4: Perform key-value structured segmentation on the table dataset by row and logical partition to obtain a set of table semantic blocks.
[0126] Preferably, the core of this step is to perform structured segmentation of the table data, converting the table into a key-value pair format of "column name: value", thus preserving the table's structural information and semantic integrity.
[0127] For row-by-row splitting, each row of data is converted into a sequence of key-value pairs, for example: At the same time, the table header information is retained as the key name.
[0128] Specifically, for logical partitioning, the subtotal row, total row, and blank row in the table are identified as partition boundaries, and the large table is split into multiple logical sub-tables, each of which corresponds to an independent business semantic unit.
[0129] Among them, the partition boundary detection rules detect whether the cell content contains keywords such as "subtotal", "total", or "summary", or whether the row is a completely empty cell, as the basis for partitioning.
[0130] For the table semantic block metadata, the source file name, worksheet name, row number range, column number range, and partition type are recorded to facilitate subsequent traceability.
[0131] For example, a sales data table contains sales records for 12 months. After logical partitioning, it is divided into 12 semantic blocks, each corresponding to the sales data of one month, such as "Product Name: XX Product; Sales Amount: 15000; Sales Quantity: 100".
[0132] Step S1.5: Perform visual element recognition and structured description generation on the image dataset to obtain an image semantic block set.
[0133] Preferably, the core of this step is to complete the structured semantic extraction of image data. Visual elements in the image are identified through an object detection model, and text content is extracted by combining optical character recognition to generate a structured semantic description.
[0134] For visual element recognition, a pre-trained object detection model (such as YOLO) is used to identify visual elements in the chart, such as the title area, legend area, coordinate axis area, and data area, and to obtain the bounding box coordinates of each element.
[0135] For optical character recognition, an OCR engine is used to extract the text content within each visual element area, including coordinate axis labels, legend text, data labels, etc.
[0136] For structured description generation, visual element information is integrated with OCR results into a natural language description, for example: Where t is the chart type, For X-axis labels, Y-axis labels This describes the data points.
[0137] Specifically, the image semantic block metadata records the source file name, image size, list of detected element types, and generation timestamp.
[0138] For example, a product parameter chart is a bar chart. After visual element recognition, the title area, X-axis (month), Y-axis (indicator), and data area are detected. After OCR extraction, a structured description is generated: "Bar chart: X-axis is month, Y-axis is indicator (currency unit), January value is 200, February value is 180, March value is 220".
[0139] Step S1.6: The set of text semantic blocks is vectorized and encoded using the Sentence-BERT pre-trained model to obtain a set of text semantic vectors.
[0140] Preferably, the core of this step is to complete the vectorization encoding of text semantic blocks. The text blocks are mapped into 768-dimensional dense vectors through the Sentence-BERT model to realize semantic similarity calculation.
[0141] For the Sentence-BERT model, a pre-trained all-MiniLM-L6-v2 model is used, with an output vector dimension of d=384, and an average pooling strategy is used during inference.
[0142] In the encoding process, the text semantic block is input into the model, and the [CLS] vector of the last hidden state is extracted as the semantic representation. The calculation formula is as follows: .
[0143] in, For the i-th text semantic block, This is the corresponding semantic vector.
[0144] For batch encoding, a batch processing method is adopted to improve efficiency, with a batch size of B=32, and GPU parallel acceleration is supported.
[0145] For vector normalization, after encoding, the vector is L2 normalized, and the formula is as follows: .
[0146] For example, a product instruction manual is segmented into 16 semantic text blocks, which, after Sentence-BERT encoding, result in 16 384-dimensional vectors, forming a set of semantic text vectors. .
[0147] Step S1.7: The set of table semantic blocks is vectorized and encoded using a TaBERT pre-trained model to obtain a set of table semantic vectors.
[0148] Preferably, the core of this step is to complete the vectorized encoding of the table semantic blocks, and capture the structural information and semantic content of the table through the TaBERT model to realize the vector representation of the table semantics.
[0149] For the TaBERT model, a pre-trained TaBERT-base model is used, with an output vector dimension of d=768, which is optimized for the table structure.
[0150] For the table encoding strategy, the key-value pair sequence is converted into text form and input into the model. The [CLS] vector is extracted as the semantic representation of the table, and the calculation formula is as follows: .
[0151] in, Let be the sequence of key-value pairs for the j-th table semantic block.
[0152] Specifically, for sequence length processing, when the key-value pair sequence length exceeds 512, a truncation strategy is adopted, retaining the first 256 tokens and the last 256 tokens.
[0153] Specifically, to unify the vector dimensions, a linear transformation is used to project the 768-dimensional vector to 384 dimensions, aligning it with the text vector dimensions. The projection formula is as follows: .
[0154] in, It is a learnable projection matrix.
[0155] For example, a sales data table is divided into 12 semantic blocks, which are then encoded using TaBERT to obtain 12 768-dimensional vectors. After projection, these vectors are converted into 12 384-dimensional vectors, forming a set of semantic vectors for the table. .
[0156] Step S1.8: The image semantic block set is vectorized and encoded using the CLIP visual encoder to obtain an image semantic vector set.
[0157] Preferably, the core of this step is to complete the vectorized encoding of image semantic blocks, extract image features through CLIP visual encoder, and realize the vector representation of image semantics.
[0158] For the CLIP model, a pre-trained ViT-B / 32 visual encoder is used, and the input image size is adjusted to... The output vector dimension is d=512.
[0159] Image preprocessing includes resizing, center cropping, and normalization, with the normalization parameter being the mean. Standard deviation .
[0160] In the encoding process, the preprocessed image is input into the visual encoder, and the vector at the last layer [CLS] position is extracted as the semantic representation of the image. The calculation formula is as follows: .
[0161] in, The original image corresponding to the k-th image semantic block. This is the corresponding semantic vector.
[0162] Specifically, for vector dimension unification, a learnable linear transformation is used to project a 512-dimensional vector to a 384-dimensional vector, achieving dimensional alignment with text and table vectors.
[0163] For example, a product parameter chart, after being encoded by a CLIP visual encoder, yields a 512-dimensional vector, which is then projected and converted into a 384-dimensional vector, forming an image semantic vector set. .
[0164] Step S1.9: Merge the text semantic vector set, the table semantic vector set, and the image semantic vector set to obtain a multi-source semantic vector set.
[0165] Preferably, the core of this step is to complete the fusion of multimodal semantic vectors, unifying the vectors of text, tables, and images into a multi-source semantic vector set.
[0166] Specifically, the merging strategy involves concatenating the three types of vector sets sequentially to form a unified multi-source semantic vector set, represented as follows: .
[0167] in, It is a collection of multi-source semantic vectors, containing A vector.
[0168] Specifically, for vector source tags, a modal label is added to each vector, for example: This facilitates subsequent modal differentiation processing.
[0169] In terms of metadata association, each vector maintains a one-to-one correspondence with the metadata of the original semantic block, ensuring vector traceability.
[0170] For example, the text semantic vector set contains 16 vectors, the table semantic vector set contains 12 vectors, and the image semantic vector set contains 1 vector. After merging, a multi-source semantic vector set is obtained. There are 29 vectors of 384 dimensions in total, each vector is accompanied by a modality label and source metadata.
[0171] Beneficial effects of steps S1.1 to S1.9: This series of steps, through semantic unit segmentation and multimodal coding, realizes a unified semantic representation of multi-source heterogeneous data, solves the problem of cross-modal semantic gap between text, tables, and images, and lays a data foundation for subsequent hierarchical index construction and cross-modal retrieval.
[0172] The process includes the following steps: Step S1.1: Data acquisition and format normalization to eliminate differences in data sources; Step S1.2: Modal segmentation to provide a basis for differentiated processing; Step S1.3: Accurate segmentation of text semantic blocks through sliding window and semantic coherence detection to ensure semantic integrity; Step S1.4: Preservation of table structure information through key-value structured segmentation to maintain business semantic units; Step S1.5: Structured description of image semantics through visual element recognition to convert unstructured images into searchable text; Step S1.6: Text semantic vectorization through Sentence-BERT encoding; Step S1.7: Table semantic vectorization through TaBERT encoding; Step S1.8: Image semantic vectorization through CLIP encoding; and Step S1.9: Merging of the three types of semantic vectors to construct a unified multi-source semantic vector space.
[0173] Step S2 involves constructing a hierarchical semantic index tree structure from the multi-source semantic vector set using two-layer K-means clustering.
[0174] Furthermore, step S2 specifically includes the following steps.
[0175] Step S2.1: Perform L2 norm global normalization on the multi-source semantic vector set to obtain a normalized semantic vector set.
[0176] Preferably, the core of this step is to normalize the vector magnitude, eliminate the influence of vector magnitude differences on distance calculation, and provide standardized input for subsequent clustering.
[0177] For L2 norm normalization, the Euclidean norm of each vector is calculated using the following formula: .
[0178] in, For the i-th vector, Let d be the j-th dimension component of the vector, and d=384 be the vector dimension.
[0179] For the normalization operation, the vector is divided by its L2 norm to obtain a unit vector, as shown in the formula: .
[0180] After normalization, all vectors have a magnitude of 1 and are distributed on the unit hypersphere.
[0181] For numerical stability, when the vector norm is less than When the value is zero, set it directly to the zero vector to avoid division by zero error.
[0182] For example, a multi-source semantic vector set contains 29 vectors, each with a dimension of 384. After L2 normalization, a normalized semantic vector set is obtained. All vectors have a magnitude of 1.
[0183] Step S2.2: Count the total number of vectors in the normalized semantic vector set, and calculate the number of first-level cluster centers by taking the arithmetic square root of the total number of vectors.
[0184] Preferably, the core of this step is to automatically calculate the size of the first-level clusters, and determine the number of cluster centers by the arithmetic square root of the total number of vectors, thereby achieving adaptive cluster size.
[0185] Specifically, for the total number of vectors, the normalized semantic vector set is traversed, and the number of vectors in the set is calculated, denoted as N.
[0186] The formula for calculating the number of cluster centers is as follows: .
[0187] in, This represents the number of cluster centers in the first layer. This is the floor function.
[0188] Among them, a minimum value is set for the range constraint of the number of cluster centers. maximum value When the calculated value exceeds the range, the boundary value is taken.
[0189] For example, the normalized semantic vector set contains N=29 vectors, calculated as follows: The number of cluster centers in the first layer is 6.
[0190] Step S2.3: Perform the first-level clustering on the normalized semantic vector set using the K-means clustering algorithm to obtain the first-level cluster center vector set and the first-level intra-cluster vector grouping set.
[0191] Preferably, the core of this step is to perform coarse-grained partitioning of the normalized semantic vector set, using the K-means algorithm to divide the vectors into... The first clustering result is generated by creating 1 cluster.
[0192] For the initialization of the K-means algorithm, the K-means++ initialization strategy is used to select the initial cluster centers, and the formula is as follows: .
[0193] in, For vectors Distance to the nearest selected center.
[0194] For the distance metric, Euclidean distance is used, and the formula is: .
[0195] in, Let be the k-th cluster center vector.
[0196] For the clustering iteration process, the following steps are repeated until convergence.
[0197] ①Assignment steps: Assign each vector to the nearest cluster center.
[0198] ②Update step: Recalculate the center vector of each cluster.
[0199] ③ Convergence criterion: The change in cluster centers is less than the threshold. Or reach the maximum number of iterations .
[0200] For the first-level clustering output, we obtain Set of cluster center vectors ,as well as Cluster-in-cluster vector grouping set .
[0201] For example, the normalized semantic vector set contains 29 vectors, which are divided into 6 clusters after K-means clustering. The set of the center vectors of the first-level clusters... It contains 6 384-dimensional vectors, and the number of vectors in each cluster is {5, 4, 6, 3, 7, 4} respectively.
[0202] Step S2.4: For each vector group set within the first layer cluster, count the number of grouped vectors, and calculate the number of second-layer cluster centers by taking the arithmetic square root of the number of grouped vectors.
[0203] Preferably, the core of this step is to automatically calculate the size of the second-level clusters and independently calculate the number of cluster centers for each first-level group, thereby achieving adaptive fine-grained clustering.
[0204] Specifically, for counting the number of grouped vectors, the vector groups within each first-level cluster are traversed. Count the number of vectors in this vector, denoted as . .
[0205] The formula for calculating the number of cluster centers in the second layer is as follows: .
[0206] in, This represents the number of second-level cluster centers corresponding to the k-th group, with a minimum value of 2 to ensure that there are at least two sub-clusters.
[0207] The number of cluster centers for each group is stored as an array [ , ,..., This facilitates subsequent calls to the second-level clustering.
[0208] For example, the number of vectors for the 6 clusters in the first layer are {5, 4, 6, 3, 7, 4}, and the number of cluster centers in the second layer is calculated to be {3, 2, 3, 2, 3, 2}, for a total of A second-layer cluster.
[0209] Step S2.5: Perform a second-level clustering on each group set using the K-means clustering algorithm to obtain the second-level cluster center vector set and the second-level intra-cluster vector group set.
[0210] Preferably, the core of this step is to complete the fine-grained division of each first-level group, and to further divide each first-level cluster into several sub-clusters using the K-means algorithm to generate the second-level clustering results.
[0211] For the second-level clustering process, each first-level group is... Perform K-means clustering independently, with the number of cluster centers being... .
[0212] Specifically, for the second-level clustering output, the set of sub-cluster center vectors corresponding to each group is obtained. and the set of vector groups within subclusters .
[0213] For the global index of the second-level cluster centers, all second-level cluster centers are numbered sequentially, and the global index formula is: .
[0214] Where (k,m) represents the m-th sub-cluster of the second layer under the k-th first-layer cluster.
[0215] Specifically, for the second-level clustering output summary, the set of all second-level cluster center vectors is summarized. Second-layer cluster intra-cluster vector grouping set ,in, Include A central vector.
[0216] For example, the first cluster in the first layer contains 5 vectors, which are then divided into 3 sub-clusters after K-means clustering. The set of the center vectors of the sub-clusters is... It contains three 384-dimensional vectors, and the number of vector groups within each sub-cluster is {2, 1, 2}.
[0217] Step S2.6: Perform hierarchical association binding between the first layer cluster center vector set and the second layer cluster center vector set to obtain a hierarchical cluster center association structure.
[0218] Preferably, the core of this step is to complete the hierarchical relationship construction of the two-layer cluster centers, establish the association relationship between parent and child nodes, and form a tree-like structural skeleton.
[0219] For hierarchical relationships, a tree data structure is used, with each first-level cluster center serving as the parent node, and its subordinate clusters... Each second-layer cluster center serves as a child node.
[0220] For storing relationships, an adjacency list structure is used, for example... .
[0221] in, For the k-th first-layer cluster center, It is the center of the m-th second-layer cluster under its jurisdiction.
[0222] Specifically, for node attribute storage, each cluster center node stores the cluster ID, vector representation, list of subordinate sub-clusters, and the ID of its parent cluster (only for second-level nodes).
[0223] For example, the center of the first cluster Associated 3 second-level sub-cluster centers , , This forms a hierarchical relationship of four nodes, one parent and three children.
[0224] Step S2.7: Bind each vector in the multi-source semantic vector set to the node pointer of the corresponding second-layer cluster vector group to obtain a hierarchical tree node set with vector indexes.
[0225] Preferably, the core of this step is to complete the mapping between the original vector and the tree node, associating each vector in the multi-source semantic vector set with the corresponding second-level sub-cluster node, thereby realizing the indexing of vectors to tree nodes.
[0226] For vector-node binding, each vector in the set of normalized semantic vectors is traversed. Based on the second-level sub-cluster group to which it belongs, a pointer mapping relationship is established.
[0227] For the pointer storage structure, a dictionary structure is used, where the key is the vector index i and the value is a node pointer, for example: .
[0228] in, This represents the m-th sub-cluster node of the k-th first-layer cluster.
[0229] For inverted indices, a reverse mapping from nodes to vectors is also established, for example... .
[0230] in, , ,..., It is the set of vector indices belonging to this node.
[0231] Specifically, for vector metadata inheritance, the original metadata of the vector (source file name, modal label, etc.) is copied into the tree node attributes to achieve hierarchical inheritance of metadata.
[0232] For example, normalized semantic vectors Belongs to the first level, cluster 1, sub-cluster 2, establish pointers. Meanwhile, reverse index Includes vector index [3].
[0233] Step S2.8: Perform topological sorting and structural solidification on the hierarchical tree node set to obtain the hierarchical semantic index tree structure.
[0234] Preferably, the core of this step is to complete the topological organization and persistent storage of the hierarchical tree, determine the node order through topological sorting, and generate a storable index file through structural solidification.
[0235] For topological sorting, a breadth-first search (BFS) strategy is adopted, starting from the root node and traversing level by level to output the node sequence, ensuring that the parent node is before the child node.
[0236] For the tree structure representation, JSON format is used for storage, for example... .
[0237] Each of them It includes node attributes and a list of child nodes.
[0238] For the structure fixation, the topologically sorted tree structure is serialized and stored as a binary file, with the file name format being index_tree_yyyyMMDD HHmmss.bin, and the storage path being the specified index directory.
[0239] Specifically, the index tree metadata records the construction timestamp, the total number of vectors N, and the number of clusters in the first level. Total number of second-layer clusters Vector dimension d.
[0240] For example, a hierarchical semantic index tree contains 6 first-level nodes and 15 second-level nodes. After topological sorting, it outputs a sequence of 21 nodes, which is then serialized to generate an index file.
[0241] The beneficial effects of steps S2.1 to S2.8: This series of steps constructs a hierarchical semantic index tree through two-layer K-means clustering, realizing efficient organization and fast retrieval of large-scale semantic vectors, reducing the retrieval time complexity from O(N) to O(logN).
[0242] The process involves the following steps: Step S2.1 performs L2 norm normalization to eliminate the influence of vector magnitude on distance calculation; Step S2.2 determines the number of cluster centers in the first layer by taking the square root of the total number of vectors, achieving automatic adaptation of the cluster size; Step S2.3 completes the first layer of K-means clustering to generate coarse-grained cluster partitions; Step S2.4 determines the number of cluster centers in the second layer by taking the square root of the number of grouping vectors, achieving adaptive fine-grained cluster size; Step S2.5 completes the second layer of K-means clustering to generate fine-grained cluster partitions; Step S2.6 completes hierarchical association binding to establish the parent-child relationship between the two layers of cluster centers; Step S2.7 completes vector pointer binding to establish the mapping between nodes and original semantic blocks; and Step S2.8 completes topological sorting and structure solidification to generate a persistently stored hierarchical semantic index tree structure.
[0243] Step S3: In response to external user queries, the hierarchical semantic index tree structure is combined with the user query, and the RoBERTa model is used for intent recognition and entity extraction to obtain structured query triples.
[0244] Furthermore, step S3 specifically includes the following steps.
[0245] Step S3.1: Perform text standardization and noise character filtering on the user's question text to obtain the standardized user question text.
[0246] Preferably, the core of this step is to preprocess the user-input text, eliminate text format differences through standardization operations, and improve the quality of subsequent encoding through noise filtering.
[0247] The text standardization operations include: converting full-width characters to half-width characters, converting traditional Chinese characters to simplified Chinese characters, compressing consecutive spaces into single spaces, decoding HTML entities, and uniformly encoding URLs.
[0248] Among them, the filtering of noisy characters includes: removing emojis, removing control characters, removing invisible characters, and removing duplicate punctuation marks.
[0249] Specifically, for text encoding uniformity, the input text encoding is uniformly converted to UTF-8 to ensure encoding consistency in subsequent processing.
[0250] For example, the original user question "Where is the PDF file??" is transformed into "Where is the PDF file?" after text normalization. Emoticons, control characters, and extra spaces are removed, resulting in the standardized user question text. .
[0251] Step S3.2: Semantically encode the standardized user question text using the RoBERTa pre-trained model to obtain the user question semantic vector.
[0252] Preferably, the core of this step is to complete the vectorized representation of the user's question, extract the deep semantic features of the text through a pre-trained language model, and generate a dense vector that can be used for similarity calculation.
[0253] For the RoBERTa model, a pre-trained Roberta-base-Chinese model is used, with hidden layer dimensions... The vocabulary size is approximately 50,000 words, and it supports Chinese text encoding.
[0254] In the encoding process, the hidden state at the last layer [CLS] position is extracted from the standardized user query text input model as a semantic representation, and the formula is as follows: .
[0255] in, The semantic vector for the user's question has a dimension of 768.
[0256] Specifically, for vector dimensionality reduction, a 768-dimensional vector is reduced to 384 dimensions through linear projection, aligning with the dimensions of the multi-source semantic vector. The formula is as follows: .
[0257] in, Let be the projection matrix. This is the reduced-dimensional user query vector.
[0258] For L2 normalization, L2 normalization is performed on the dimension-reduced vector, and the formula is as follows: .
[0259] For example, a user's question "What is the sales volume of product A?" is encoded using RoBERTa to obtain a 768-dimensional vector. After projecting this vector to 384 dimensions and normalizing it, the semantic vector of the user's question is obtained. .
[0260] Step S3.3: Perform layer-by-layer cosine similarity matching between the user's query semantic vector and the hierarchical semantic index tree to obtain a Top-K candidate semantic block set.
[0261] Preferably, the core of this step is to quickly retrieve candidate semantic blocks. Through layer-by-layer matching of the hierarchical index tree, the K semantic blocks that are most similar to the user's question semantics are located in O(logN) complexity.
[0262] For the first layer of matching, the cosine similarity between the user's query vector and all the first layer cluster centers is calculated using the following formula.
[0263] .
[0264] Select the first with the highest similarity Each cluster is selected as a candidate parent cluster.
[0265] For the second layer of matching, for each candidate parent cluster, the cosine similarity between the user's query vector and the centers of its subordinate sub-clusters is calculated, and the top clusters with the highest similarity are selected. Each sub-cluster is considered as a candidate sub-cluster.
[0266] Specifically, for candidate semantic block extraction, the original semantic blocks associated with candidate sub-cluster nodes are extracted and aggregated to form a Top-K candidate semantic block set. .
[0267] Specifically, for the hierarchical matching parameter, the number of candidates in the first level is set. Second-level candidate number The final number of candidates is K=15.
[0268] For example, the similarity between the user's query vector and the centers of the six first-layer clusters is calculated, with the top three clusters numbered {2, 1, 4}. The similarity between the sub-clusters of these three clusters is then calculated, and the top five similarities for each sub-cluster are selected, ultimately resulting in a Top-K set of 15 candidate semantic blocks. .
[0269] Step S3.4: Predict the intent category label of the standardized user question text using the RoBERTa sequence labeling branch to obtain the user question intent label.
[0270] Preferably, the core of this step is to classify the user's question intent, predict the intent category through a sequence labeling model, and achieve automatic identification of user needs.
[0271] Among them, the predefined intent categories for the intent category system include: query (asking for data, searching for documents), statistics (summing, averaging, sorting), comparison (difference analysis, trend comparison), and description (requesting explanation, requesting description).
[0272] For the sequence labeling model, a fully connected classification layer is added after the RoBERTa encoder to output the intent category probability distribution, as shown in the formula: .
[0273] in, The hidden state at position [CLS] This is the intention probability vector.
[0274] For intent label prediction, the category with the highest probability is selected as the predicted intent, and the formula is as follows: .
[0275] in, The predicted intent label.
[0276] For example, if a user asks "What is the sales volume of product A?", the probability distribution after intent classification is {Query: 0.85, Statistics: 0.10, Comparison: 0.03, Explanation: 0.02}, and the predicted intent label is "Query".
[0277] Step S3.5: Perform named entity boundary detection and type labeling on the standardized user question text using the RoBERTa sequence labeling branch to obtain the user question entity set.
[0278] Preferably, the core of this step is to identify key entities in the user's query, detect entity boundaries and label entity types through a sequence labeling model, and extract key elements of the query.
[0279] For the entity type system, the predefined entity types include: product name (e.g., "product A"), indicator name (e.g., "sales amount"), time range (e.g., "YYYY year"), region (e.g., "a certain region"), and department (e.g., "sales department").
[0280] For the sequence labeling architecture, the BIO labeling system is adopted, where BX represents the starting position of entity X, IX represents the internal position of entity X, and O represents a non-entity position.
[0281] For entity boundary detection, the output probability for each token is given by the formula: .
[0282] in, Let i be the hidden state of the i-th token. This is the labeled probability vector.
[0283] For entity sequence decoding, the Viterbi algorithm is used to solve for the optimal annotation sequence to obtain the entity boundary and type.
[0284] For example, if a user asks "What is the sales revenue of product A in year YYYY", after entity recognition, the labeled sequence is "B-product name I-product name OB-time range I-time range OB-metric name I-metric name OOO", and the entity set E is extracted as {product name: product A, time range: year YYYY, metric name: sales revenue}.
[0285] Step S3.6: Perform field alignment mapping between the user's question intent tag and the user's question entity set to obtain a set of intent-entity association pairs.
[0286] Preferably, the core of this step is to complete the structured association between intents and entities, organizing independent intent tags and entity sets into intent-entity association pairs, providing an intermediate representation for subsequent triple generation.
[0287] For field alignment rules, the entity slot mapping relationship is determined based on the intent type. For example, query intents are associated with slots for (main entity, query metric, time range); statistical intents are associated with slots for (statistical object, statistical metric, time range).
[0288] Specifically, for generating association pairs, entities from the entity set are filled into the corresponding positions according to the slot mapping relationship, for example... .
[0289] Where P represents an intent-entity association pair. Let be the entity value of the i-th slot.
[0290] Specifically, for handling missing slots, when a slot has no corresponding entity, a default value is filled in or a null value is marked as NULL.
[0291] For example, the intent tag “Query Class” corresponds to the slot (subject, metric, time), and the entity set {product name: product A, metric name: sales amount, time range: YYYY year}, after field alignment, we get the association pair P=(Query Class, {product A, sales amount, YYYY year}).
[0292] Step S3.7: Perform semantic association verification between the intent-entity association pair set and the Top-K candidate semantic block set to obtain the verified intent-entity association pair set.
[0293] Preferably, the core of this step is to verify the semantic consistency between the association pair and the candidate semantic block. By calculating the semantic similarity between the association pair and the candidate semantic block, irrelevant or low-relevance association pairs are filtered out to ensure the accuracy of the extraction results.
[0294] Specifically, for semantic association calculation, the intent-entity association pair is converted into a natural language description, such as "the user wants to query the information of {subject}'s {indicator} at {time}", and semantic similarity is calculated with the Top-K candidate semantic blocks.
[0295] For similarity threshold verification, a similarity threshold is set. When the highest similarity between the associated pair and all candidate semantic blocks is lower than the threshold, it is judged as a misidentification and is removed.
[0296] Specifically, for the generation of the verified set, the verified intent-entity association pairs are retained to form the verified association pair set. .
[0297] For example, the association pair P=(query class, {product A, sales amount, YYYY year}) is converted into the text "The user wants to query the sales amount of product A in the year YYYY". The similarity is calculated with the Top-K candidate semantic blocks. The highest similarity is 0.82, which is higher than the threshold of 0.5, so the association pair is retained.
[0298] Step S3.8: The verified intent-entity association set is restructured in a structured manner to obtain a structured query triplet.
[0299] Preferably, the core of this step is to complete the conversion of the association pair into the standard triple format, reorganizing the intent-entity association pair into a (entity, intent, attribute) triple form, providing structured input for subsequent prompt word generation.
[0300] Among them, the triplet format definition is as follows: .
[0301] in, To query the main entity, For intent type, This is for querying a set of attributes.
[0302] Specifically, for the construction of attribute sets, auxiliary entities such as indicators, time ranges, and regions in the association pairs are integrated into attribute sets, for example... .
[0303] In the multi-entity query processing, when there are multiple subject entities, multiple triples are generated, and each triple corresponds to a subject entity.
[0304] For example, after verification, the associated pair =(Query type, {Product A, Sales amount, Year YYYY}), after structured recombination, we get the triple T=(Product A, Query, {Indicator: Sales amount, Time: Year YYYY}), which serves as the structured input for generating subsequent prompts.
[0305] Beneficial effects of steps S3.1 to S3.8: This series of steps, through intent recognition and entity extraction, realizes the transformation of users' natural language questions into structured queries, accurately understands user needs, and provides accurate semantic expression for subsequent prompt word generation.
[0306] The process includes the following steps: Step S3.1: Text standardization and noise filtering to improve subsequent encoding quality; Step S3.2: Generating user query semantic vectors through RoBERTa encoding to provide vector representation for similarity matching; Step S3.3: Rapidly locating Top-K candidate semantic blocks through layer-wise cosine similarity matching to provide relevance for intent verification; Step S3.4: Predicting intent category labels through the RoBERTa sequence labeling branch to achieve automatic classification of user needs; Step S3.5: Detecting named entity boundaries through the RoBERTa entity recognition branch to extract key query elements; Step S3.6: Completing field alignment mapping between intent labels and entity sets to generate intent-entity association pairs; Step S3.7: Ensuring the accuracy and relevance of the extracted results through semantic association verification and filtering out misidentified results; and Step S3.8: Completing structured recombination to generate (entity, intent, attribute) triples to provide structured input for prompt word generation.
[0307] Step S4: Fill the structured query triples into the multi-type prompt template to generate candidate prompt words, and perform comparative learning encoding through SimCSE-RoBERTa to obtain the candidate prompt word encoding set.
[0308] Furthermore, step S4 specifically includes the following steps.
[0309] Step S4.1: Split the structured query triples into entity fields, intent fields, and attribute fields.
[0310] Preferably, the core of this step is to complete the field parsing of the triples, splitting the structured triples into independent field components to provide input for subsequent template filling.
[0311] Among them, for the field splitting rules, from triples Extract entity fields Intent field Attribute fields .
[0312] Specifically, expanding the attribute field involves expanding the attribute set. Each attribute item (indicator, time, region, etc.) in the data is extracted as an independent variable, denoted as . , ... Where m is the number of attribute items.
[0313] For example, the triple T = (Product A, Query, {Indicator: Sales Revenue, Time: Year YYYY}) can be split to obtain the entity field. =“Product A”, Intent field ="Query", attribute fields expanded to =“Sales Revenue” =“YYYY year”.
[0314] Step S4.2: Fill the entity field, intent field, and attribute field into the predefined multi-type prompt template to obtain the initial candidate prompt word set.
[0315] Preferably, the core of this step is to generate candidate prompt words using templates. By using predefined multi-type prompt templates, structured fields are converted into candidate prompt words in natural language form.
[0316] Among them, three types of templates are predefined for prompt template types: text templates (suitable for text data source queries), table templates (suitable for table data source queries), and image templates (suitable for chart data source queries).
[0317] Among them, for template filling rules, text-type templates are such as "Please find the {attribute} information about {entity} from text data"; table-type templates are such as "Please retrieve the {attribute} value of {entity} from table data"; and image-type templates are such as "Please extract the {attribute} trend of {entity} from chart data".
[0318] Specifically, for candidate suggestion word generation, field values are filled into template placeholders to obtain multiple candidate suggestions words, which constitute the initial candidate suggestion word set. .
[0319] For example, after the entity field "Product A" and the attribute fields "Sales Revenue, Year YYYY" are entered into a table template, the candidate prompt "Please retrieve the sales revenue value of Product A in the year YYYY" from the table data is generated. It includes candidate prompts generated from three types of templates.
[0320] Step S4.3: Perform text standardization and special character filtering on the initial candidate prompt word set to obtain a standardized candidate prompt word set.
[0321] Preferably, the core of this step is to complete the text preprocessing of candidate prompt words, eliminate format differences through standardization operations, and improve the quality of subsequent encoding through special character filtering.
[0322] The text standardization operations include: converting full-width characters to half-width characters, compressing consecutive spaces into single spaces, removing leading and trailing whitespace characters, and unifying punctuation marks to Chinese punctuation.
[0323] Among them, the filtering of special characters includes: removing HTML tags, removing Markdown syntax symbols, removing invisible Unicode characters, and removing abnormally encoded characters.
[0324] For example, the initial candidate prompt "Please retrieve the sales figures for product A in year YYYY" is standardized by removing extra spaces and abnormal characters, resulting in a standardized set of candidate prompts. .
[0325] Step S4.4: Initial semantic encoding of the standardized candidate prompt word set is performed using the SimCSE-RoBERTa pre-trained model to obtain an initial prompt word vector set.
[0326] Preferably, the core of this step is to complete the vectorization representation of candidate prompt words, extract the semantic features of prompt words through the SimCSE pre-trained model, and generate initial vectors that can be used for comparative learning optimization.
[0327] For the SimCSE-RoBERTa model, a pre-trained simcse-roberta-base model is used, with hidden layer dimensions... Output vector dimension .
[0328] In the encoding process, standardized candidate prompts are input into the model, and the hidden state at the [CLS] position is extracted as the prompt vector, as shown in the formula: .
[0329] in, For the i-th standardized candidate prompt word, This is the corresponding initial prompt word vector.
[0330] For batch coding, batch processing is used to improve efficiency, and the batch size is... =16, supports GPU parallel acceleration.
[0331] For example, a standardized set of candidate prompt words. The set of three prompt words is obtained after SimCSE-RoBERTa encoding. Each vector has a dimension of 768.
[0332] Step S4.5: Construct semantically similar positive sample pairs and a global negative sample queue to obtain a set of comparative learning samples.
[0333] Preferably, the core of this step is to construct the comparative learning training data, which uses semantically similar prompt word vectors to bring them closer together through positive samples with the same semantics, and pushes out semantically dissimilar vectors through a global negative sample queue.
[0334] For constructing positive sample pairs, the user query vector is... With candidate prompt word vectors Pairing is determined when candidate suggestions are highly semantically relevant to the user's question, for example... .
[0335] Where y=1 represents the positive sample label.
[0336] Specifically, for the construction of the negative sample queue, a global negative sample queue is maintained. Store the prompt word vectors from historical batches. The queue capacity is M=65536. When the prompt word vector of a positive sample pair is not semantically similar to the vector in the queue, the vector in the queue is used as a negative sample.
[0337] For negative sample labels, when the prompt word vector From the negative sample queue and related to the user's query vector When irrelevant, for example: .
[0338] Where y=0 represents the negative sample label.
[0339] For example, user question vectors With candidate prompt word vectors Construct positive sample pairs and negative sample queues Historical prompt word vectors stored in This constitutes a negative sample.
[0340] Step S4.6: Perform vector space alignment optimization using the InfoNCE loss layer of SimCSE-RoBERTa to obtain the optimized set of prompt word vectors.
[0341] Preferably, the core of this step is to complete the comparative learning optimization of the prompt word vectors, and to achieve semantic alignment of the vector space by using the InfoNCE loss function to bring positive sample pairs closer and push negative sample pairs apart.
[0342] For the InfoNCE loss function, the formula is: .
[0343] in, For cosine similarity, Here, M represents the temperature hyperparameter, and M represents the number of negative samples.
[0344] Specifically, for gradient backpropagation, the gradient is calculated based on the InfoNCE loss, and the parameters of SimCSE-RoBERTa are updated to increase the similarity of positive sample pairs and decrease the similarity of negative sample pairs.
[0345] For optimization iterations, the Adam optimizer is used, with a learning rate of... Weight decay , 10 training rounds.
[0346] For example, positive sample pairs The similarity of the samples was improved from the initial 0.72 to 0.85 after optimization, and the similarity of the negative sample pairs was reduced from 0.45 to 0.20, thus achieving semantic alignment in the vector space.
[0347] Step S4.7: Perform L2 norm global normalization on the optimized prompt word vector set to obtain a normalized prompt word vector set.
[0348] Preferably, the core of this step is to normalize the magnitude of the optimized vectors, unifying the magnitude of all prompt word vectors to 1, thus providing standardized input for subsequent similarity calculations.
[0349] For L2 norm normalization, the Euclidean norm of each optimized vector is calculated and divided by its magnitude, as shown in the formula: .
[0350] in, For the optimized i-th prompt word vector, This is the normalized prompt word vector.
[0351] For numerical stability, when the vector norm is less than When the value is zero, set it directly to the zero vector to avoid division by zero error.
[0352] For example, the optimized set of prompt word vectors , , After L2 normalization, we obtain the set of normalized prompt word vectors. , , All vectors have a magnitude of 1.
[0353] Step S4.8: Bind the normalized prompt word vector set to the standardized candidate prompt word set one by one to obtain the candidate prompt word encoding set.
[0354] Preferably, the core of this step is to complete the association and storage of vectors and text, pairing and binding the normalized prompt word vectors with the corresponding standardized text to form a searchable set of candidate prompt word codes.
[0355] For the binding structure, a dictionary structure is used for storage, for example... .
[0356] in, Encode the i-th candidate prompt word. For normalized vectors, For standardized text, This is a template type tag.
[0357] For the set representation, the candidate prompt word encoding set is represented as follows: .
[0358] Where n is the number of candidate prompt words.
[0359] For example, normalized prompt word vectors This is linked to the standardized text "Please retrieve the sales figures for product A in year YYYY" to form a code. The type label is "table", and the encoding set is... Includes all bound codes.
[0360] Beneficial effects of steps S4.1 to S4.8: This series of steps, through multi-type prompt templates and contrastive learning encoding, realizes the dynamic generation and semantic alignment optimization of candidate prompt words, and enhances the semantic matching ability between prompt words and user queries.
[0361] The process includes the following steps: Step S4.1 splits the triplet field to provide structured input for template filling; Step S4.2 generates diverse candidate prompt words using predefined multi-type templates to cover query scenarios from different data sources; Step S4.3 standardizes text format and filters special characters to improve subsequent encoding quality; Step S4.4 generates initial prompt word vectors using SimCSE-RoBERTa encoding to provide a basic representation for comparative learning optimization; Step S4.5 establishes training objectives for comparative learning by constructing positive and negative sample pairs to enhance semantic discrimination capabilities; Step S4.6 optimizes the vector space distribution using the InfoNCE loss layer to cluster semantically similar prompt words in the vector space; Step S4.7 performs L2 norm normalization to unify vector magnitudes for easier subsequent similarity calculations; and Step S4.8 binds vectors to text to establish a candidate prompt word encoding set.
[0362] Step S5: Perform double semantic alignment on the candidate prompt word encoding set to calculate multidimensional semantic similarity and credibility score, and obtain a comprehensive score candidate answer set.
[0363] Furthermore, step S5 specifically includes the following steps.
[0364] Step S5.1: Input the candidate prompt word encoding set into the large language model to trigger the retrieval of associated semantic blocks and the generation of answers, thereby obtaining the initial candidate answer set and the associated data source semantic block set.
[0365] Preferably, the core of this step is to automatically generate candidate answers by generating natural language answers based on candidate prompts using a large language model, while simultaneously recording the semantic blocks of the original data source referenced by the answer.
[0366] For large language models, pre-trained generative language models (such as the GPT series) are used, with a parameter count between 7B and 70B, supporting Chinese natural language generation.
[0367] Specifically, for the prompt word input format, the candidate prompt word text is concatenated with the associated original data semantic block to form a complete input, for example... .
[0368] in, For the i-th candidate prompt word, This is a context fragment for the semantic block of the associated data source.
[0369] For the answer generation process, an autoregressive generation method is adopted, in which the model outputs the answer text token by token until an end symbol is generated or the maximum length is reached. .
[0370] Specifically, for records associated with a data source, the semantic block index of the data source referenced in the answer is recorded simultaneously with the generated answer, for example: .
[0371] in, For the i-th initial candidate answer, The j-th data source semantic block associated with it.
[0372] For example, the candidate prompt "Please retrieve the sales amount of product A in year YYYY" is concatenated with the associated data source "sales data table of product A" and input into the large language model to generate the initial candidate answer "the sales amount of product A in year YYYY is 150 million yuan", while recording the semantic block index of the associated data source.
[0373] Step S5.2: Semantically encode the initial candidate answer set and the associated data source semantic block set to obtain the candidate answer semantic vector set and the data source semantic vector set.
[0374] Preferably, the core of this step is to complete the vectorized representation of candidate answers and data sources, and to extract deep semantic features through a pre-trained language model to provide vector input for subsequent similarity calculation.
[0375] For the encoding model, the same RoBERTa model as in step S3.2 is used to ensure the consistency of the vector space.
[0376] For candidate answer encoding, the initial candidate answer text is input into the model, and the hidden state at the [CLS] position is extracted as the semantic representation, as shown in the formula: .
[0377] in, Let i be the semantic vector of the i-th candidate answer. For the hidden layer dimension.
[0378] For data source encoding, the semantic block text associated with the data source is input into the model, and semantic vectors are extracted using the following formula: .
[0379] in, The semantic vector of the j-th data source associated with the i-th answer.
[0380] Specifically, for vector dimensionality reduction and normalization, the same projection and L2 normalization operations as in step S3.2 are performed to ensure dimension alignment.
[0381] For example, the initial candidate answer "Product A's sales revenue in year YYYY was 150 million yuan" is encoded into a 768-dimensional vector, projected onto a 384-dimensional vector and normalized to obtain the candidate answer semantic vector. .
[0382] Step S5.3: Calculate the cosine similarity between the candidate answer semantic vector set and the data source semantic vector set to obtain a semantic similarity score set.
[0383] Preferably, the core of this step is to quantify the semantic consistency between candidate answers and data sources, and to measure the degree of semantic matching between the answer content and the original data through cosine similarity, thereby achieving the first level of semantic alignment.
[0384] For cosine similarity calculation, the formula is as follows: .
[0385] in, This is the normalized candidate answer vector. This is the normalized data source vector.
[0386] For multi-data source aggregation, when a candidate answer is associated with semantic blocks from multiple data sources, the maximum similarity is taken as the semantic similarity score of that answer, as shown in the formula: .
[0387] in, The semantic similarity score is given to the i-th candidate answer.
[0388] For example, candidate answer vectors Related data source vector The cosine similarity score is 0.88, indicating that the answer content is highly consistent with the semantics of the original data.
[0389] Step S5.4: Perform statistical calculations on the structuring degree, update timeliness, and source authority of the semantic block set of the associated data source to obtain a set of data source credibility scores.
[0390] Preferably, the core of this step is to complete a multi-dimensional credibility assessment of the data source. By quantifying the structure, update timeliness, and authority of the data source, it provides a basis for the reliability of the data source for evaluating the quality of candidate answers.
[0391] The score for structured data is calculated based on the modality of the data source: For tabular data sources, the score is... Text-based data source score Image-based data source score .
[0392] For update timeliness scoring, it is calculated based on the time difference between the most recent update time of the data source and the current time, using the following formula: .
[0393] in, Time difference (unit: days) This is the time-related decay coefficient; the larger the time difference, the lower the score.
[0394] For the source authority score, a value is assigned based on the source type of the data source: official database score. Internal document score Score based on publicly available external data .
[0395] The overall credibility score is calculated by weighting and fusing the scores from the three dimensions, using the following formula: .
[0396] in, , , Let be the weighting coefficient, satisfying .
[0397] For example, if a candidate answer is associated with a tabular data source that was last updated 30 days ago and originates from an official database, the calculation will yield... , , Overall credibility score .
[0398] Step S5.5: Perform linear weighted fusion of the semantic similarity score set and the data source credibility score set to obtain a comprehensive score set.
[0399] Preferably, the core of this step is to complete the multi-dimensional comprehensive scoring of candidate answers, and to achieve a balance between the two quality indicators by linearly weighting and fusing semantic similarity and data source credibility.
[0400] The formula for calculating the overall score is as follows: .
[0401] in, For semantic similarity weights, This is the credibility weight.
[0402] Among them, the weight parameters can be dynamically adjusted according to the application scenario. Value: Increase the credibility weight when the accuracy of the data is required, and increase the similarity weight when the relevance of the answer is required.
[0403] For example, the semantic similarity score of a candidate answer Data source credibility score Overall score .
[0404] Step S5.6: Bind the comprehensive score set with the initial candidate answer set to obtain the comprehensive score candidate answer set.
[0405] Preferably, the core of this step is to complete the associated storage of scores and answers, and to attach the comprehensive score to the corresponding candidate answer to form a set of comprehensive score candidate answers that can be sorted and filtered.
[0406] For the binding structure, a dictionary structure is used for storage, for example... .
[0407] in, For the i-th candidate answer with comprehensive score, it includes the answer text, comprehensive score, and associated data source.
[0408] For set representation, the set of candidate answers for comprehensive scoring is represented as: .
[0409] Where n is the number of candidate answers.
[0410] For example, the initial candidate answer "Product A's sales revenue in year YYYY was 150 million yuan" is linked to a comprehensive score of 0.888 to form a scoring candidate answer. ,gather Includes all candidate answers after rating.
[0411] Beneficial effects of steps S5.1 to S5.6: This series of steps achieves multi-dimensional quality assessment of candidate answers through dual semantic alignment and credibility scoring, ensuring that the answer content is consistent with both the user's intent and the original data, and effectively filtering high-quality answers.
[0412] The process includes the following steps: Step S5.1 generates candidate answers and associates them with the data source, establishing a source relationship between the answers and the original data; Step S5.2 generates answer vectors and data source vectors through semantic encoding, providing vector representation for similarity calculation; Step S5.3 calculates the cosine similarity between candidate answers and the data source, quantifying the degree of consistency between the answers and the original data; Step S5.4 evaluates the structured nature, update timeliness, and source authority of the data source through statistical calculations, generating a credibility score; Step S5.5 performs a linear weighted fusion of semantic similarity and credibility to achieve a multi-dimensional comprehensive score; and Step S5.6 binds the comprehensive score to the candidate answers, constructing a sortable and filterable set of comprehensive score candidate answers.
[0413] Step S6: Sort and filter the candidate answer set of comprehensive scoring to reconstruct the optimal prompt words and input them into the large language model to obtain the optimized AI question answering results.
[0414] Furthermore, step S6 specifically includes the following steps.
[0415] Step S6.1: Sort the candidate answer set of the comprehensive score in descending order to obtain the sorted candidate answer set.
[0416] Preferably, the core of this step is to sort the candidate answers by quality, reorganize the candidate answers from high to low according to the comprehensive score, and provide an ordered set for the extraction of the best candidate.
[0417] Among them, the ranking rule is to combine the candidate answer set for comprehensive scoring. Based on overall score Sort in descending order, the formula is: .
[0418] in, This is the sorted set of candidate answers.
[0419] For stable ranking, when multiple candidate answers have the same score, they are sorted in ascending order by generation timestamp to ensure the determinism of the ranking result.
[0420] For example, the candidate answer set with comprehensive scores contains 3 answers with comprehensive scores of {0.888, 0.756, 0.923}. After sorting in descending order, the sorted sequence is {0.923, 0.888, 0.756}, corresponding to the answer index {3, 1, 2}.
[0421] Step S6.2: Extract the candidate answer with the highest score from the sorted candidate answer set as the optimal candidate answer.
[0422] Preferably, the core of this step is to extract the optimal answer by selecting the candidate answer with the highest comprehensive score from the first position of the sorted set as the final output candidate.
[0423] For optimal candidate extraction, the first element of the sorted set is taken, as shown in the formula: .
[0424] in, This is the optimal candidate answer.
[0425] For threshold verification, when the overall score of the optimal candidate answer is lower than a preset threshold... When a response is marked as low confidence, a reoperation process can be triggered.
[0426] For example, if the top answer in the sorted candidate answer set has a comprehensive score of 0.923, which is higher than the threshold of 0.6, it is extracted as the optimal candidate answer. It includes the answer text, score, and associated data source.
[0427] Step S6.3: Extract the semantic block of the associated data source and the structured query triple of the optimal candidate answer.
[0428] Preferably, the core of this step is to extract the source information of the optimal answer, obtain the semantic block of the original data source on which the answer is based and the structured representation of the user query, and provide data fragments for the reconstruction of prompt words.
[0429] Among them, for the extraction of related data sources, from the optimal candidate answer Extract the set of semantic blocks related to the data source, for example .
[0430] in, Let j be the semantic block of the associated data source, and k be the number of associated data sources.
[0431] Specifically, for structured triple extraction, the structured representation corresponding to the current query is extracted from the triple T generated in step S3.8, for example... .
[0432] in, To query entities, For query intent, This is for querying a set of attributes.
[0433] For example, the best candidate answer The associated data source is "the 3rd row of the Product A sales data table", and the structured triple is T=(Product A, query, {indicator: sales amount, time: YYYY year}).
[0434] Step S6.4: Concatenate and reconstruct the semantic block of the associated data source and the structured query triple to obtain the reconstruction prompt words.
[0435] Preferably, the core of this step is to dynamically reconstruct the prompt words, concatenating the original data fragments with structured query information to construct enhanced prompt words that include explicit data injection.
[0436] For splicing template design, a fixed template format is used, such as... .
[0437] in, Asking questions for the original user, Text concatenation for semantic blocks associated with data sources.
[0438] For data source text concatenation, multiple related data source semantic blocks are concatenated in order of source priority, with the priority rule being: table type > text type > image type, and each data source is separated by a newline character.
[0439] Specifically, for the length truncation strategy, when the total length of the reconstructed prompt words exceeds the maximum input length of the model... At that time, data source segments are truncated from low to high priority to ensure that critical data is not lost.
[0440] For example, after concatenating the data source "Product A's annual sales in YYYY are 150 million yuan" with a triple, the following reconstruction prompt is generated: "User question: What is the annual sales of Product A in YYYY? [SEP] Reference data: Product A's annual sales in YYYY are 150 million yuan. [SEP] Please answer the user question based on the reference data."
[0441] Step S6.5: Standardize the format and filter redundancy of the reconstruction prompt words to obtain standardized reconstruction prompt words.
[0442] Preferably, the core of this step is to complete the text preprocessing of the reconstructed prompt words, eliminate splicing traces through format standardization, and improve information density through redundancy filtering.
[0443] The format standardization operations include: uniformly replacing special delimiters with natural line breaks, compressing consecutive spaces into single spaces, replacing template placeholders with actual values, and checking and fixing quotation mark pairings.
[0444] The redundant filtering operations include: removing duplicate data source fragments, removing contextual information that is irrelevant to the user's question, and removing expired metadata tags (such as timestamps, filenames, etc.).
[0445] For example, after the reconstruction prompts are formatted, the [SEP] separator is replaced with a natural line break, and extra spaces are compressed to obtain standardized reconstruction prompts. .
[0446] Step S6.6: Input the standardized reconstructed prompt words into the large language model for autoregressive generation to obtain the optimized AI question-answering results.
[0447] Preferably, the core of this step is to generate the final answer by using a large language model to generate a natural language answer based on the reconstructed prompt words through autoregression, and explicitly injecting the original data to ensure that the answer is consistent with the data source.
[0448] Among them, for the generation parameter settings, the temperature parameter (Lower temperature ensures stable generation), top-p sampling parameters Maximum generation length Repeat penalty parameters .
[0449] For the autoregressive generation process, the model predicts the probability distribution of the next word t+1 for each token, and selects the token with the highest probability as the output, as shown in the formula: .
[0450] in, For conditional probability, Standardize the refactoring prompts.
[0451] Among them, the generation termination condition is when the model outputs an end symbol or reaches the maximum generation length. Generation will stop when the time is right.
[0452] For example, after the standardized reconstructed prompt words are input into the large language model, the model generates an autoregressive response: "According to the sales data table, the sales revenue of product A in year YYYY was 150 million yuan."
[0453] Step S6.7: Format the optimized AI question-answering results and clean up special characters, and output the final optimized AI question-answering results.
[0454] Preferably, the core of this step is to complete the post-processing of the final answer, ensuring the standardization of the output through format regularization and improving readability through special character cleanup.
[0455] Among the format standardization operations are: standardizing paragraph line breaks, unifying punctuation marks to Chinese punctuation marks, standardizing numerical units (e.g., "150 million yuan" is standardized to "150 million yuan"), and uniformly adding the prefix "according to..." at the beginning of the answer.
[0456] The special character cleanup includes: removing noisy characters during the generation process, removing Markdown format symbols, removing abnormal Unicode encodings, and fixing garbled characters.
[0457] The final output format uses structured JSON for storage, for example... .
[0458] in, For the final output, This is the final answer text.
[0459] For example, after the model generates a formatted response, it outputs: "According to the sales data table, the sales revenue of product A in year YYYY was 150 million yuan.", with a confidence level of 0.923, and the data source is "product A sales data table".
[0460] Beneficial effects of steps S6.1 to S6.7: This series of steps, through sorting, filtering and prompt word reconstruction, achieves accurate extraction of the optimal answer and explicit injection of raw data, guiding the large language model to generate accurate answers that are highly consistent with the data source, thus curbing the generation of AI illusions from the source.
[0461] The process involves the following steps: Step S6.1 sorting candidate answers in descending order to quickly locate the optimal answer; Step S6.2 extracting semantic blocks from the associated data sources of the optimal candidate answer to provide raw data fragments for prompt word reconstruction; Step S6.3 concatenating the semantic blocks from the data sources with structured triples to build the content foundation for reconstructing the prompt words; Step S6.4 standardizing the format and filtering redundancy to improve the standardization and information density of the prompt words; Step S6.5 generating attention masks and positional encoding to adapt to the input format requirements of the large language model; Step S6.6 generating accurate answers through autoregressive generation using the large language model; and Step S6.7 format regularization and special character cleanup to output the final optimized AI question-answering result.
[0462] In summary, the overall beneficial effects of steps S1 to S6 in this embodiment are as follows.
[0463] This method forms a complete technical loop of multi-source data semantic alignment and AI question answering optimization through steps S1 to S6, effectively overcoming the technical problems of inconsistent semantic representation, low retrieval accuracy, lack of deep semantic consistency verification, and easy generation of AI illusion in traditional RAG systems under multi-source heterogeneous data scenarios. This method first segments multi-source heterogeneous data into semantic units in step S1, using Sentence-BERT, TaBERT, and CLIP encoders for vectorization encoding to construct a unified semantic vector space, addressing the cross-modal semantic gap between text, tables, and images. Then, in step S2, a hierarchical semantic index tree structure is built using two-layer K-means clustering, providing a data foundation for efficient retrieval of large-scale semantic vectors. Step S3 responds to user queries, using the RoBERTa model to achieve intent recognition and entity extraction, transforming natural language queries into structured query triples for accurate understanding of user needs. Step S4 fills the structured triples into multi-type prompt templates to generate candidate prompt words, and enhances semantic alignment capabilities through SimCSE-RoBERTa contrastive learning. Step S5 ensures consistency between candidate answers and both user intent and data source through a dual semantic alignment mechanism, combining credibility scoring to filter high-quality candidate answers. Finally, step S6 reconstructs the optimal prompt words and drives a large language model to generate accurate and reliable answers, achieving end-to-end optimization from data collection, index construction, intent understanding, prompt word generation, quality filtering to the final answer.
[0464] In step S1, the text is segmented into units using a sliding window and semantic coherence detection. Tables are then segmented into key-value structures by row and logical partitioning. Visual element recognition generates a structured description of the image. These three semantic blocks are encoded and mapped to a 768-dimensional unified space, eliminating modal differences. Step S2 eliminates the influence of vector magnitude length through L2 norm normalization. The number of cluster centers is determined by taking the square root of the total number of vectors. A two-layer index structure, from coarse clustering to sub-clustering, is constructed. During retrieval, cosine similarity is matched layer by layer, reducing the retrieval complexity from O(N) to O(logN). Step S3 predicts the intent category and detects entity boundaries using RoBERTa sequence labeling branches. Step S4 generates triples and simultaneously matches the user's question vector with the index tree layer by layer to obtain Top-K candidate semantic blocks for verification. Step S5 generates initial candidate prompt words using three types of prompt templates: text, table, and image. It then uses the SimCSE-RoBERTa InfoNCE loss layer for comparative learning optimization. Step S6 calculates the semantic similarity between the candidate answer and the user's question, as well as the consistency score with the data source, and integrates credibility indicators such as the degree of structuring, update timeliness, and source authority. Step S7 sorts the candidates according to the comprehensive score, selects the best candidates, reconstructs the prompt words, and explicitly injects them into the original data fragments to guide the large language model to generate accurate answers that are highly consistent with the data source.
[0465] like Figure 2 As shown, this embodiment provides an example of an AI question-answering accuracy optimization device. In this embodiment, the AI question-answering accuracy optimization device is applied to the AI question-answering accuracy optimization method as described in the above embodiment.
[0466] Specifically, the AI question-answering accuracy optimization device includes a multi-source heterogeneous data segmentation module 1, a hierarchical semantic index tree structure construction module 2, a user question recognition and extraction module 3, a candidate prompt word encoding set generation module 4, a candidate prompt word encoding semantic alignment module 5, and an AI question-answering result optimization module 6, which are electrically or communicatively connected in sequence.
[0467] The multi-source heterogeneous data segmentation module 1 is used to segment multi-source heterogeneous data into semantic units and then perform vectorization encoding through Sentence-BERT, TaBERT, and CLIP encoders to obtain a multi-source semantic vector set. The hierarchical semantic index tree structure construction module 2 is used to construct a hierarchical semantic index tree structure from the multi-source semantic vector set through two-layer K-means clustering. The user query recognition and extraction module 3 is used to respond to external user queries, combine the hierarchical semantic index tree structure with the user query, and perform intent recognition and entity extraction through the RoBERTa model to obtain a structured query. The candidate prompt word encoding set generation module 4 is used to fill the structured query triples into multi-type prompt templates to generate candidate prompt words, and performs comparative learning encoding through SimCSE-RoBERTa to obtain the candidate prompt word encoding set; the candidate prompt word encoding semantic alignment module 5 is used to perform double semantic alignment on the candidate prompt word encoding set to calculate multi-dimensional semantic similarity and credibility scores to obtain a comprehensive score candidate answer set; the AI question answering result optimization module 6 is used to sort and filter the comprehensive score candidate answer set to reconstruct the optimal prompt words and input them into the large language model to obtain the optimized AI question answering result.
[0468] like Figure 3 As shown, the electronic device 7 includes a processor 71 and a memory 72 coupled to the processor 71.
[0469] The memory 72 stores program instructions for implementing the AI question-answering accuracy optimization method of any of the above embodiments.
[0470] The processor 71 is used to execute program instructions stored in the memory 72 to perform AI question-answering accuracy optimization.
[0471] The processor 71 can also be referred to as a CPU (Central Processing Unit). The processor 71 may be an integrated circuit chip with signal processing capabilities. The processor 71 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. A general-purpose processor can be a microprocessor or any conventional processor.
[0472] Furthermore, Figure 4 This is a schematic diagram of the structure of a storage medium according to an embodiment of this application. See also: Figure 4 The storage medium 8 in this embodiment stores program instructions 81 capable of implementing all the above methods. These program instructions 81 can be stored in the storage medium as a software product, including several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods in each embodiment of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks, or terminal devices such as computers, servers, mobile phones, and tablets.
[0473] In the several embodiments provided in this application, it should be understood that the disclosed systems, methods, and approaches can be implemented in other ways. For example, the system embodiments described above are merely illustrative. For instance, the AI question-answering accuracy optimization of a unit is only a logical function of AI question-answering accuracy optimization. In actual implementation, there may be other AI question-answering accuracy optimization methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not performed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection of the system or unit may be electrical, mechanical, signal, or other forms.
[0474] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated units described above can be implemented in hardware or as software functional units. The above are merely embodiments of this application and do not limit the patent scope of this application. Any equivalent structural or procedural transformations made based on the description and drawings of this application, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.
Claims
1. A method for optimizing the accuracy of AI question answering, characterized in that, The AI question-answering accuracy optimization method includes: Step S1: Semantic units are segmented into multi-source heterogeneous data, and vectorized encoding is performed using Sentence-BERT, TaBERT, and CLIP encoders respectively to obtain a set of multi-source semantic vectors. Step S2: The multi-source semantic vector set is constructed into a hierarchical semantic index tree structure through two-layer K-means clustering; Step S3: In response to an external user's question, the hierarchical semantic index tree structure is combined with the user's question, and the intent is identified and entity is extracted using the RoBERTa model to obtain a structured query triple. Step S4: Fill the structured query triples into the multi-type prompt template to generate candidate prompt words, and perform comparative learning encoding through SimCSE-RoBERTa to obtain a candidate prompt word encoding set; Step S5: Perform double semantic alignment on the candidate prompt word encoding set to calculate multidimensional semantic similarity and credibility score, and obtain a comprehensive score candidate answer set; Step S6: Sort and filter the comprehensive scoring candidate answer set to reconstruct the optimal prompt words and input them into the large language model to obtain the optimized AI question answering results.
2. The AI question-answering accuracy optimization method according to claim 1, characterized in that, Step S1 involves segmenting the multi-source heterogeneous data into semantic units and then vectorizing them using Sentence-BERT, TaBERT, and CLIP encoders to obtain a multi-source semantic vector set, including: Step S1.1: Collect multi-source heterogeneous raw data and perform format normalization processing to obtain a standardized multi-source dataset; Step S1.2: Modal splitting is performed on the standardized multi-source dataset to obtain text dataset, tabular dataset, and image dataset; Step S1.3: The text dataset is segmented into units by using a sliding window combined with semantic coherence detection to obtain a set of text semantic blocks; Step S1.4: Perform key-value structured segmentation on the table dataset by row and logical partition to obtain a set of table semantic blocks; Step S1.5: Perform visual element recognition and structured description generation on the image dataset to obtain an image semantic block set; Step S1.6: The set of text semantic blocks is vectorized and encoded using the Sentence-BERT pre-trained model to obtain a set of text semantic vectors; Step S1.7: The set of table semantic blocks is vectorized and encoded using a TaBERT pre-trained model to obtain a set of table semantic vectors; Step S1.8: The image semantic block set is vectorized and encoded using the CLIP visual encoder to obtain an image semantic vector set; Step S1.9: Merge the text semantic vector set, the table semantic vector set, and the image semantic vector set to obtain a multi-source semantic vector set.
3. The AI question-answering accuracy optimization method according to claim 1, characterized in that, Step S2 involves constructing a hierarchical semantic index tree structure from the multi-source semantic vector set using two-layer K-means clustering, including: Step S2.1: Perform L2 norm global normalization on the multi-source semantic vector set to obtain a normalized multi-source semantic vector set; Step S2.2: Count the total number of vectors in the normalized multi-source semantic vector set, and calculate the number of first-layer cluster centers by taking the arithmetic square root of the total number of vectors; Step S2.3: Input the normalized multi-source semantic vector set and the number of cluster centers in the first layer into the K-means clustering algorithm to obtain the first layer cluster center vector set and the first layer intra-cluster vector grouping set; Step S2.4: Count the number of group vectors for each group in the first layer of cluster vector grouping set, and calculate the number of second-layer cluster centers for each group by taking the arithmetic square root of the number of group vectors. Step S2.5: Input the first layer cluster in-cluster vector grouping set and the second layer cluster center number into the K-means clustering algorithm to obtain the second layer cluster center vector set and the second layer cluster in-cluster vector grouping set; Step S2.6: Perform hierarchical association binding between the first layer cluster center vector set and the second layer cluster center vector set to obtain a hierarchical cluster center association structure; Step S2.7: Bind the vector grouping set within the second-level cluster to the hierarchical cluster center association structure using vector pointers to obtain a set of hierarchical tree nodes with vector indexes; Step S2.8: Perform topological sorting and structural solidification on the hierarchical tree node set to obtain the hierarchical semantic index tree structure.
4. The AI question-answering accuracy optimization method according to claim 1, characterized in that, Step S3: In response to an external user query, the hierarchical semantic index tree structure is combined with the user query, and intent recognition and entity extraction are performed using the RoBERTa model to obtain structured query triples, including: Step S3.1: In response to an external user's question, perform text standardization and noise character filtering on the externally input user question to obtain the standardized user question text; Step S3.2: Semantically encode the standardized user question text using the RoBERTa pre-trained model to obtain the user question semantic vector; Step S3.3: Perform layer-by-layer cosine similarity matching between the user query semantic vector and the hierarchical semantic index tree structure to obtain a Top-K candidate semantic block set; Step S3.4: Input the standardized user question text into the RoBERTa sequence labeling branch to perform intent category label prediction and obtain the user question intent label; Step S3.5: Input the standardized user question text into the RoBERTa entity recognition branch to perform named entity boundary detection and type labeling to obtain the user question entity set; Step S3.6: Perform field alignment mapping between the user's question intent tag and the user's question entity set to obtain a set of intent-entity association pairs; Step S3.7: Perform semantic association verification between the intent-entity association pair set and the Top-K candidate semantic block set to obtain the verified intent-entity association pair set; Step S3.8: The verified intent-entity association set is restructured in a structured manner to obtain a structured query triplet.
5. The AI question-answering accuracy optimization method according to claim 1, characterized in that, Step S4: Fill the structured query triples into the multi-type prompt template to generate candidate prompt words, and perform comparative learning encoding through SimCSE-RoBERTa to obtain a candidate prompt word encoding set, including: Step S4.1: Split the structured query triples into fields to obtain entity fields, intent fields, and attribute fields; Step S4.2: Fill the entity field, intent field, and attribute field into the predefined text, table, and image prompt templates respectively to obtain the initial candidate prompt word set; Step S4.3: Standardize the text format and filter special characters of the initial candidate prompt word set to obtain a standardized candidate prompt word set; Step S4.4: The standardized candidate prompt word set is initially semantically encoded using the SimCSE-RoBERTa pre-trained model to obtain an initial prompt word vector set; Step S4.5: Construct semantically similar positive sample pairs and a global negative sample queue for the initial prompt word vector set to obtain a comparative learning sample set; Step S4.6: Input the contrastive learning sample set into the InfoNCE loss layer of SimCSE-RoBERTa to perform vector space alignment optimization, and obtain the optimized prompt word vector set; Step S4.7: Perform L2 norm global normalization on the optimized prompt word vector set to obtain a normalized prompt word vector set; Step S4.8: Bind the normalized prompt word vector set to the standardized candidate prompt word set one by one to obtain the candidate prompt word encoding set.
6. The AI question-answering accuracy optimization method according to claim 1, characterized in that, Step S5: Perform double semantic alignment on the candidate prompt word encoding set to calculate multidimensional semantic similarity and credibility scores, obtaining a comprehensive score candidate answer set, including: Step S5.1: Input the standardized candidate prompt words corresponding to the candidate prompt word encoding set into the large language model to trigger the retrieval of associated semantic blocks and the generation of answers, thereby obtaining the initial candidate answer set and the associated data source semantic block set; Step S5.2: Semantically encode the initial candidate answer set and the associated data source semantic block set to obtain the candidate answer semantic vector set and the data source semantic vector set; Step S5.3: Calculate the cosine similarity between the candidate answer semantic vector set and the data source semantic vector set to obtain a semantic similarity score set; Step S5.4: Perform statistical calculations on the structuring degree, update timeliness, and source authority of the semantic block set of the associated data source to obtain a set of data source credibility scores; Step S5.5: Perform linear weighted fusion of the semantic similarity score set and the data source credibility score set to obtain a comprehensive score set of candidate answers; Step S5.6: Bind the comprehensive score set of candidate answers to the initial candidate answer set one by one to obtain the comprehensive score candidate answer set.
7. The AI question-answering accuracy optimization method according to claim 1, characterized in that, Step S6: Sort and filter the comprehensive scoring candidate answer set to reconstruct the optimal prompt words and input them into the large language model to obtain the optimized AI question-answering results, including: Step S6.1: Sort the candidate answer set according to the comprehensive score from high to low to obtain the sorted candidate answer set; Step S6.2: Extract the semantic block of the associated data source corresponding to the optimal candidate answer in the sorted candidate answer set to obtain the optimal data source semantic block; Step S6.3: Concatenate the semantic block of the optimal data source with the structured query triple to obtain the basic content of the reconstructed prompt words; Step S6.4: Standardize the format and filter redundant content of the basic content of the reconstruction prompt words to obtain standardized reconstruction prompt words; Step S6.5: Input the standardized reconstructed prompt words into the context input layer of the large language model, and perform attention mask generation and position encoding to obtain the final input sequence of the model; Step S6.6: Input the final input sequence of the model into the large language model for autoregressive generation to obtain the initial generated response text; Step S6.7: The format of the initially generated answer text is standardized and special characters are cleaned up to obtain the optimized AI question-and-answer result.
8. An AI question-answering accuracy optimization device, wherein the AI question-answering accuracy optimization device is applied to the AI question-answering accuracy optimization method as described in any one of claims 1 to 7, characterized in that, The AI question-answering accuracy optimization device includes: The multi-source heterogeneous data segmentation module is used to segment multi-source heterogeneous data into semantic units, and then perform vectorization encoding through Sentence-BERT, TaBERT, and CLIP encoders respectively to obtain a multi-source semantic vector set. The hierarchical semantic index tree structure construction module is used to construct a hierarchical semantic index tree structure from the multi-source semantic vector set through two-layer K-means clustering. The user question identification and extraction module is used to respond to external user questions by combining the hierarchical semantic index tree structure with the user question and performing intent identification and entity extraction through the RoBERTa model to obtain structured query triples. The candidate prompt word encoding set generation module is used to fill the structured query triples into a multi-type prompt template to generate candidate prompt words, and to perform comparative learning encoding through SimCSE-RoBERTa to obtain the candidate prompt word encoding set. The candidate prompt word encoding semantic alignment module is used to perform double semantic alignment on the candidate prompt word encoding set to calculate multi-dimensional semantic similarity and credibility score, and obtain a comprehensive score candidate answer set; The AI question-answering result optimization module is used to sort and filter the comprehensive score candidate answer set, reconstruct the optimal prompt words and input them into the large language model to obtain the optimized AI question-answering result.
9. An electronic device, characterized in that, The method includes a processor and a memory coupled to the processor, the memory storing program instructions executable by the processor; when the processor executes the program instructions stored in the memory, it implements the AI question-answering accuracy optimization method as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores program instructions that, when executed by a processor, enable the AI question-answering accuracy optimization method as described in any one of claims 1 to 7.